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This volume contains documents which supplement the information contained in Volume 
1 of The usixf Programmer's Manual The documents here are grouped roughly into the areas 
of basics, editing, language tools, document preparation, and system maintenance. Further 
general information may be found in the Bell System Technical Journal special issue on UNIX, 
July- August, 1978. 

Many of the documents cited within this volume as Bell Laboratories internal memoranda 
or Computing Science Technical Reports (CSTR) are also contained here. 

These documents contain occasional localisms, typically references to other operating sys- 
tems like GCOS and IBM. In all cases, such references may be safely ignored by UNIX users. 

General Works 

1. 7th Edition UNIX — Summary. 

A concise summary of the facilities available on UNIX. 

2. The UNIX Time-Sharing System. D. M. Ritchie and K. Thompson. 

The original UNIX paper, reprinted from CACM. 

Getting Started 

3. UNIX for Beginners — Second Edition. B. W. Kernighan. 

An introduction to the most basic use of the system. 

4. A Tutorial Introduction to the UNIX Text Editor. B. W. Kernighan. 

An easy way to get started with the editor. 

5. Advanced Editing on UNIX. B. W. Kernighan. 

The next step. 

6. An Introduction to the UNIX Shell. S. R. Bourne. 

An introduction to the capabilities of the command interpreter, the shell. 

^/ 7. Learn — Computer Aided Instruction on UNIX. M. E. Lesk and B. W. Kernighan. 

Describes a computer-aided instruction program that walks new users through the 
basics of files, the editor, and document preparation software. 

Document Preparation 

8. Typing Documents on the UNIX System. M. E. Lesk. 

Describes the basic use of the formatting tools. Also describes **— ms", a standard- 
ized package of formatting requests that can be used to lay out most documents 
(including those in this volume). 
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9. A System for Typesetting Mathematics. B. YT. Kernighan and L. L. Cherry 

Describes EQN, an easy-to-learn language for doing high-quality 
mathematical typesetting. 

10. TBL - A Program to Format Tables. M. E. Lesk. 

A program to permit easy specification of tabular material for 
typesetting. Again, easy to learn and use. 

11. Some Applications of Inverted Indexes on the UNIX System. M. E. Lesk. 

Describes, among other things, the program REFER which fills in 
bibliographic citations from a data base automatically. 

12. NR0FF/TR0FF User's Manual. J. F. Ossanna. 

The basic formatting program. 

13. A TROFF Tutorial B. W. Kernighan. 

An introduction to TROFF for those who really want to know such 
things. 

Programming 

14. The C Programming Language - Reference Manual. D. M. Ritchie. 

Official statement of the syntax and semantics of C. Should be sup- 
plemented by The C Programming Language, B. W. Kernighan and D. 
M. Ritchie, Prentice-Hall, 1978, which contains a tutorial introduction 
and many examples. 

15. Lint, A C Program Checker. S. C. Johnson. 

Checks C programs for syntax errors, type violations, portability 
problems, and a variety of probable errors. 
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16. Make - A Program for Maintaining Computer Programs. S. I. Feldman. 

Indespensable tool for making sure that large programs are properly 
compiled with minimal effort. 

y 17. UNIX Programming. B. W. Kernighan and D. M. Ritchie. 

Describes the programmming interface to the operating system and 
the standard I/O library. 

18. A Tutorial Introduction fo ADB. J. F. Maranzano and S. R. Bourne. 
How to use the ADB Debugger. 

Supporting Tools and Languages 

r. 19. YACC: Yet Another Compiler-Compiler. S. C. Johnson. 

Converts a BNF specification of a language and semantic actions writ- 
ten in C into a compiler of the language. 

' 20. LEX - A lexical Analyzer Generator. M. E. Lesk and E. Schmidt. 

Creates a recognizer for a set of regular expressions; each regular 
expression can be followed by arbitrary C code which will be executed 
when the regular expression is found. 

21. A Portable Fortran 77 Compiler. S. I. Feldman and P. J. Weinberger. 

The first Fortran 77 compiler, and still one of the best. NOTE: This 
document has been moved to Volume 2c of the UNIX Programmer's 
Manual. 

22. Ratfor - A Preprocessor for a Rational Fortran. B. W. Kernighan. 

Converts a Fortran with C-like control structures and cosmetics into 
real, ugly Fortran. 

23. The M4 Macro Processor. B. W. Kernighan and D. M. Ritchie. 

M4 is a macro processor useful as a front end for C, Ratfor, Cobol, and 
in its own right. 
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>/ 24. SED — A Non-interactive Text Editor. L. E. McMahon. 

A variant of the editor for processing large inputs. 

>/ 25. AWK — A Pattern Scanning and Processing Language. A. V. Aho, B. W. Kernighan and 
P. J. Weinberger. 

Makes it easy to specify many data transformation and selection operations. 

26. DC — An Interactive Desk Calculator. R. H. Morris and L. L. Cherry. 

A super HP calculator, if you don't need floating point. 

27. BC — An Arbitrary Precision Desk-Calculator Language. L. L. Cherry and R. H. Morris. 

A front end for DC that provides infix notation, control flow, and built-in functions. 

28. UNIX Assembler Reference Manual. D. M. Ritchie. 

The ultimate dead language. 

Implementation, Maintenance, and Miscellaneous 

29. Setting Up UNIX - Seventh Edition. C. B. Haley and D. M. Ritchie. 

How to configure and get your system running. 

30. Regenerating System Software. C. B. Haley and D. M. Ritchie. 

What do do when you have to change things. 

31. UNIX Implementation. K. Thompson. 

How the system actually works inside. 

32. The UNIX I/O System. D. M. Ritchie. 

How the I/O system really works. 

33. A Tour Through the UNIX C Compiler. D. M. Ritchie. 

How the PDP-11 compiler works inside. 

34. A Tour Through the Portable C Compiler. S. C. Johnson. 

How the portable C compiler works inside. 

35. A Dial-Up Network of UNIX Systems. D. A. Nowitz and M. E. Lesk. 

Describes UUCP, a program for communicating files between UNIX systems. 

36. UUCP Implementation Description. D. A. Nowitz. 

How UUCP works, and how to administer it. 

37. On the Security of UNIX. D. M. Ritchie. 

Hints on how to break UNIX, and how to avoid doing so. 

38. Password Security: A Case History. R. H. Morris and K. Thompson. 

How the bad guys used to be able to break the password algorithm, and why they 
can't now, at least not so easily. 
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ABSTRACT 

This paper describes the second version of the learn program for interpret- 
ing CAI scripts on the UNixt operating system, and a set of scripts that provide 
a computerized introduction to the system. 

Six current scripts cover basic commands and file handling, the editor, 
additional file handling commands, the eqn program for mathematical typing, 
the ms" package of formatting macros, and an introduction to the C pro- 
gramming language. These scripts now include a total of about 530 lessons. 

Many users from a wide variety of backgrounds have used learn to 
acquire basic UNIX skills. Most usage involves the first two scripts, an introduc- 
tion to files and commands, and the text editor. 

The second version of learn is about four times faster than the previous 
one in CPU utilization, and much faster in perceived time because of better 
overlap of computing and printing. It also requires less file space than the first 
version. Many of the lessons have been revised; new material has been added 
to reflect changes and enhancements in the UNIX system itself. Script-writing is 
also easier because of revisions to the script language. 
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1. Introduction. 

Learn is a driver for CAI scripts. It is intended to permit the easy composition of lessons 
and lesson fragments to teach people computer skills. Since it is teaching the same system on 
which it is implemented, it makes direct use of UNixt facilities to create a controlled UNIX 
environment. The system includes two main parts: (1) a driver that interprets the lesson 
scripts; and (2) the lesson scripts themselves. At present there are six scripts: 

— basic file handling commands 

— the UNIX text editor ed 

— advanced file handling 

— the-??* language for typing mathematics 

— the "-ms" macro package for document formatting 

— the C programming language 

The purported advantages of CAI scripts for training in computer skills include the follow. 

ing: 

(a) students are forced to perform the exercises that ~re in fact the basis of training in 
any case; 

(b) students receive immediate feedback and confirmation of progress; 

(c) students may progress at their own rate; 

. (d) no schedule requirements are imposed; students may study at any time convenient 
for them; 

(e) the lessons may be improved individually and the improvements are immediately 
available to new users; 

(f) since the student has access to a computer for the CAI script there is a place to do 
exercises; 

(g) the use of high technology will improve student motivation and the interest of their 
management. 

Opposed to this, of course, is the absence of anyone to whom the student may direct questions. 
If CAI is used without a "counselor" or other assistance, it should properly be compared to a 
textbook, lecture series, or taped course, rather than to a seminar. CAI has been used for 
many years in a variety of educational areas. 1 ' 2 ' 3 The use of a computer to teach itself, how- 
ever, offers unique advantages. The skills developed to get through the script are exactly those 
needed to use the computer; there is no waste effort. 

The scripts written so far are based on some familiar assumptions about education; these 
tUNIX is a Trademark of Beil Laboratories. 



- 2 - 



assumptions are outlined in the next section. The remaining sections describe the operation of 
the script driver and the particular scripts now available. The driver puts few restrictions on the 
script writer, but the current scripts are of a rather rigid and stereotyped form in accordance 
with the theory in the next section and practical limitations. 

2. Educational Assumptions and Design. 

First, the way to teach people how to do something is to have them do it. Scripts should 
not contain long pieces of explanation; they should instead frequently ask the student to do 
some task. So teaching is always by example: the typical script fragment shows a small example 
of some technique and then asks the user to either repeat that example or produce a variation 
on it. All are intended to be easy enough that most students will get most questions right, rein- 
forcing the desired behavior. 

Most lessons fall into one of three types. The simplest presents a lesson and asks for a 
yes or no answer to a question. The student is given a chance to experiment before replying. 
The script checks for the correct reply. Problems of this form are sparingly used. 

The second type asks for a word or number as an answer. For example a lesson on Mies 
might say 

How many files are there in the current directory? Type "answer N", where N is the number 
of files. 

The student is expected to respond (perhaps after experimenting) with 
answer 1 7 

or whatever. Surprisingly often, however, the idea of a substitutable argument (i.e., replacing 
iVby 17) is difficult for non-programmer students, so the first few such lessons need real care. 

The third type of lesson is open-ended — a task is set for the student, appropriate parts of 
the input or output are monitored, and the student types ready when the task is done. Figure 1 
shows a sample dialog that illustrates the last of these, using two lessons about the cat (con- 
catenate, i.e., print) command taken from early in the script that teaches file handling. Most 
learn lessons are of this form. 

After each correct response the computer congratulates the student and indicates the les- 
son number that has just been completed, permitting the student to restart the script after that 
lesson. If the answer is wrong, the student is offered a chance to repeat the lesson. The 
"speed" rating of the student (explained in section 5) is given after the lesson number when 
the lesson is completed successfully; it is printed only for the aid of script authors checking out 
possible errors in the lessons. 

It is assumed that there is no foolproof way to determine if the student truly "under- 
stands" what he or she is doing; accordingly, the current learn scripts only measure perfor- 
mance, not comprehension. If the student can perform a given task, that is deemed to be 
"learning." 4 

The main point of using the computer is that what the student does is checked for 
correctness immediately. Unlike many CAI scripts, however, these scripts provide few facilities 
for dealing with wrong answers. In practice, if most of the answers are not right the script is a 
failure; the universal solution to student error is to provide a new, easier script. Anticipating 
possible wrong answers is an endless job, and it is really easier as well as better to provide a 
simpler script. 

Along with this goes the assumption that anything can be taught to anybody if it can be 
broken into sufficiently small pieces. Anything not absorbed in a single chunk is just subdi- 
vided. 

To avoid boring the faster students, however, an effort is made in the files and editor 
scripts to provide three tracks of different difficulty. The fastest sequence of lessons is aimed at 
roughly the bulk and speed of a typical tutorial manual and should be adequate for review and 
for well-prepared students. The next track is intended for most users and is roughly twice as 
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Figure 1 : Sample dialog from basic Ales script 

(Student responses in italics; T is the prompt) 

A file can be printed on your terminal 
by using the "cat" command. Just say 
"cat file" where "file" is the file name. 
For example, there is a file named 
"food" in this directory. List it 
by saying "cat food"; then type "ready". 
S cat food 

this is the file 

named food. 
S ready 

Good. Lesson 3.3a (1) 

Of course, you can print any file with "cat". 
In particular, it is common to first use 
"Is" to find the name of a file and then "cat" 
to print it. Note the difference between 
"Is", which tells you the name of the file, 
and "cat", which tells you. the contents. 
One file in the current directory is named for 
a President. Print the file, then type "ready*. 
S cat President 
cat: can't open President 
$ ready 

Sorry, that's not right. Do you want to try again? yes 

Try the problem again. 

$ Is 

.ocopy 

XI 

roosevelt 
S cat roosevelt 

this file is named roosevelt 

and contains three lines of 

text. 
S ready 

Good. Lesson 3.3b (0) 

The "cat" command can also print several files 

at once. In fact, it is named "cat" as an abbreviation 

for "concatenate".... 



long. Typically, for example, the fast track might present an idea and ask for a variation oh the 
example shown; the normal track will first ask the student to repeat the example that was 
shown before attempting a variation. The third and slowest track, which is often three or four 
times the length of the fast track, is intended to be adequate for anyone, (The lessons of Fig- 
ure 1 are from the third track.) The multiple tracks also mean that a student repeating a course 
is unlikely to hit the same series of lessons; this makes it profitable for a shaky user to back up 
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and try again, and many students have done so. 

The tracks are not completely distinct, however. Depending on the number of correct 
answers the student has given for the last few lessons, the program may switch tracks. The 
driver is actually capable of following an arbitrary directed graph of lesson sequences, as dis- 
cussed, in section 5. Some more structured arrangement, however, is used in all current scripts 
to aid the script writer in organizing the material into lessons. It is sufficiently difficult to write 
lessons that the three-track theory is not followed very closely except in the files and editor 
scripts. Accordingly, in some cases, the fast track is produced merely by skipping lessons from 
the slower track. In others, there is essentially only one track. 

The main reason for using the learn program rather than simply writing the same material 
as a workbook is not the selection of tracks, but actual hands-on experience. Learning by doing 
is much more effective than pencil and paper exercises. 

Learn also provides a mechanical check on performance. The first version in fact would 
not let the student proceed unless it received correct answers to the questions it set and it 
would not tell a student the right answer. This somewhat Draconian approach has been 
moderated in version 2. Lessons are sometimes badly worded or even just plain wrong: in such 
cases, the student has no recourse. But if a student is simply unable to complete one lesson, 
that should not prevent access to the rest. Accordingly, the current version of learn allows the 
student to skip a lesson that he cannot pass; a "no" answer to the "Do you want to try again?" 
question in Figure 1 will pass to the next lesson. It is still true that learn will not tell the stu- 
dent the right answer. 

Of course, there are valid objections to the assumptions above. In particular, some stu- 
dents may object to not understanding what they are doing; and the procedure of smashing 
everything into small pieces may provoke the retort "you can't cross a ditch in two jumps." 
Since writing CAI scripts is considerably more tedious than ordinary manuals, however, it is 
safe to assume that there will always be alternatives, to the scripts as a way of learning. In fact, 
for a reference manual of 3 or 4 pages it would not be surprising to have a tutorial manual of 
20 pages and a (multi-track) script of 100 pages. Thus the reference manual will exist long 
before the scripts. 

3. Scripts. 

As mentioned above, the present scripts try at most to follow a three-track theory. Thus 
little of the potential complexity of the possible directed graph is employed, since care must be 
taken in lesson construction to see that every necessary fact is presented in every possible path 
through the units. In addition, it is desirable that every unit have alternate successors to deal 
with student errors. 

In most existing courses, the first few lessons are devoted to checking prerequisites. For 
example, before the student is allowed to proceed through the editor script the script verifies 
that the student understands files and is able to type. It is felt that the sooner lack of student 
preparation is detected, the easier it will be on the student. Anyone proceeding through the 
scripts should be getting mostly correct answers; otherwise, the system will be unsatisfactory 
both because the wrong habits are being learned and because the scripts make little effort to 
deal with wrong answers. Unprepared students should not be encouraged to continue with 
scripts. 

There are some preliminary items which the student must know before any scripts can be 
tried. In particular, the student must know how to connect to a UNIX system, set the terminal 
properly, log in, and execute simple commands (e.g., learn itself). In addition, the character 
erase and line kill conventions (# and @) should be known. It is hard to see how this much 
could be taught by computer-aided instruction, since a student who does not know these basic 
skills will not be able to run the learning program. A brief description on paper is provided 
(see Appendix A), although assistance will be needed for the first few minutes. This assis- 
tance, however, need not be highly skilled. 
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The first script in the current set deals with files. It assumes the basic knowledge above 
and teaches the student about the Is, car, mv, rm, cp and ^commands. It also deals with 
the abbreviation characters *, ?, and [ ] in file names. It does not cover pipes or I/O redirec- 
tion, nor does it present the many options on the Is command. 

This script contains 31 lessons in the fast track; two are intended as prerequisite checks, 
seven are review exercises. There are a total of 75 lessons in all three tracks, and the instruc- 
tional passages typed at the student to begin each lesson total 4,476 words. The average lesson 
thus begins with a 60-word message. In general, the fast track lessons have somewhat longer 
introductions, and the slow tracks somewhat shorter ones. The longest message is 144 words 
and the shortest 14. 

The second script trains students in the use of the context editor ed, a sophisticated editor 
using regular expressions for searching. 5 AH editor features except encryption, mark names and 
V in addressing are covered. The fast track contains 2 prerequisite checks, 93 lessons, and a 
review lesson. It is supplemented by 146 additional lessons in other tracks. 

A comparison of sizes may be of interest. The ed description in the reference manual is 
2,572 words long. The ed tutorial 6 is 6,138 words long. The fast track through the adscript is 
7,407 words of explanatory messages, and the total ed script, 242 lessons, has 15,615 words. 
The average ed lesson is thus also about 60 words; the largest is 171 words and the smallest 10. 
The original ed script represents about three man-weeks of effort. 

The advanced file handling script deals with Is options, I/O diversion, pipes, and support- 
ing programs like pr, wc, tail, spell and grep. (The basic file handling script is a prerequisite.) 
It is not as refined as the first two scripts; this is reflected at least partly in the fact that it pro- 
vides much less of a full three-track sequence than they do. On the other hand, since it is per- 
ceived as "advanced/ 1 it is hoped that the student will have somewhat more sophistication and 
be better able to cope with it at a reasonably high l*vel of performance. 

A fourth script covers the eqn language for typing mathematics. This script must be run 
on a terminal capable of printing mathematics, for instance the DASI 300 and similar Diablo- 
based terminals, or the nearly extinct Model 37 teletype. Again, this script is relatively short of 
tracks: of 76 lessons, only 17 are in the second track and 2 in the third track. Most of these 
provide additional practice for students who are having trouble in the first track. 

The — ms script for formatting macros is a short one-track only script. The macro pack- 
age it describes is no longer the standard, so this script will undoubtedly be superseded in the 
future. Furthermore, the linear style of a single learn script is somewhat inappropriate for the 
macros, since the macro package is composed of many independent features, and few users 
need all of them. It would be better to have a selection of short lesson sequences dealing with 
the features independently. 

The script on C is in a state of transition. It was originally designed to follow a tutorial on 
C, but that document has since become obsolete. The current script has been partially con- 
verted to follow the order of presentation in The C Programming Language, 1 but this job is not 
complete. The C script was never intended to teach C; rather it is supposed to be a series of 
exercises for which the computer provides checking and (upon success) a suggested solution. 

This combination of scripts covers much of the material which any user will need to know 
to make effective use of the UNIX system. With enlargement of the advanced files course to 
include more on the command interpreter, there will be a relatively complete introduction to 
UNIX available via learn. Although we make no pretense that learn will replace other instruc- 
tional materials, it should provide a useful supplement to existing tutorials and reference manu- 
als. 
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4. Experience with Students. 

Learn has been installed on many different unix systems. Most of the usage is on the 
first two scripts, so these are more thoroughly debugged and polished. As a (random) sample 
of user experience, the learn program has been used at Bell Labs at Indian Hill for 10,500 les- 
sons in a four month period. About 3600 of these are in the files script, 4100 in the editor, and 
1400 in advanced files. The passing rate is about 80%, that is, about 4 lessons are passed for 
every one failed. There have been 86 distinct users of the files script, and 58 of the editor. On 
our system at Murray Hill, there have been nearly 4000 lessons over four weeks that include 
Christmas and New Year. Users have ranged in age from six up. 

It is difficult to characterize typical sessions with the scripts; many instances exist of some- 
one doing one or two lessons and then logging out, as do instances of someone pausing in a 
script for twenty minutes or more. In the earlier version of learn , the average session in the 
files course took 32 minutes and covered 23 lessons. The distribution is quite broad and 
skewed, however; the longest session was 130 minutes and there were five sessions shorter 
than five minutes. The average lesson took about 80 seconds. These numbers are roughly typ- 
ical for non-programmers; a UNIX expert can do the scripts at approximately 30 seconds per les- 
son, most of which is the system printing. 

At present working through a section of the middle of the files script took about 1.4 
seconds of processor time per lesson, and a system expert typing quickly took 15 seconds of 
real time per lesson. A novice would probably take at least a minute. Thus, as a rough approx- 
imation, a UNIX system could support ten students working simultaneously with some spare 
capacity. 

5. The Script Interpreter. 

The learn program itself merely interprets scripts. It provides facilities for the script writer 
to capture student responses and their effects, and simplifies the job of passing control to and 
recovering control from the student. This section describes the operation and usage of the 
driver program, and indicates what is required to produce a new script. Readers only interested 
in the existing scripts may skip this section. 

The file structure used by learn is shown in Figure 2. There is one parent directory 
(named lib) containing the script data. Within this directory are subdirectories, one for each 
subject in which a course is available, one for logging (named log), and one in which user sub- 
directories are created (named play). The subject directory contains master copies of all les- 
sons, plus any supporting material for that subject. In a given subdirectory, each lesson is a 
single text file. Lessons are usually named systematically; the file that contains lesson n is 
called Ln. 

When learn is executed, it makes a private directory for the user to work in, within the 
learn portion of the file system. A fresh copy of all the files used in each lesson (mostly data 
for the student to operate upon) is made each time a student starts a lesson, so the script writer 
may assume that everything is reinitialized each time a lesson is entered. The student directory 
is deleted after each session; any permanent records must be kept elsewhere. 

The script writer must provide certain basic items in each lesson: 

(1) the text of the lesson; 

(2) the set-up commands to be executed before the user gets control; 

(3) the data, if any, which the user is supposed to edit, transform, or otherwise process; 

(4) the evaluating commands to be executed after the user has finished the lesson, to decide 
whether the answer is right; and 

(5) a list of possible successor lessons. 

Learn tries to minimize the work of bookkeeping and installation, so that most of the effort 
involved in script production is in planning lessons, writing tutorial paragraphs, and coding tests 
of student performance. 
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Figure 2: 


Directory structure for learn 


lib 






play 


student 1 
student2 


files for student I ... 
files for student2... 


files 


LO.la 
LO.lb 


lessons for files course 


editor 






(other courses) 




log 







The basic sequence of events is as follows. First, learn creates the working directory. 
Then, for each lesson, learn reads the script for the lesson and processes it a line at a time. 
The lines in the script are: (1) commands to the script interpreter to print something, to create 
a files, to test something, etc.; (2) text to be printed or put in a file; (3) other lines, which are 
sent to the shell to be executed. One line in each lesson turns control over to the user; the 
user can run any UNIX commands. The user mode terminates when the user types yes , no , 
ready, or answer. At this point, the user's work is tested; if the lesson is passed, a new lesson 
is selected, and if not the old one is repeated. 

Let us illustrate this with the script for the second lesson of Figure I; this is shown in 
Figure 3. 

Lines which begin with # are commands to the learn script interpreter. For example, 
#print 

causes printing of any text that follows, up to the next line that begins with a sharp. 
#print file 

prints the contents of file', it is the same as cat file but has less overhead. Both forms of # print 
have the added property that if a lesson is failed, the #print will not be executed the second 
time through; this avoids annoying the student by repeating the preamble to a lesson. 

#create filename 

creates a file of the specified name, and copies any subsequent text up to a # to the file. This 
is used for creating and initializing working files and reference data for the lessons. 

#user 

gives control to the student; each line he or she types is passed to the shell for execution. The 
#user mode is terminated when the student types one of yes, no, ready or answer. At that 
time, the driver resumes interpretation of the script. 

#copyin 
#uncopyin 

Anything the student types between these commands is copied onto a file called .copy. This lets 
the script writer interrogate the student's responses upon regaining control. 
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Figure 3: Sample Lesson 
#print 

Of course, you can print any file with "cat". 
In particular, it is common to first use 
"Is" to find the name of a file and then "cat" 
to print it. Note the difference between 
"Is", which tells you the name of the files, 
and "cat", which tells you the contents. 
One file in the current directory is named for 
a President. Print the file, then type "ready". 
#create roosevelt 

this file is named roosevelt 

and contains three lines of 

text. 
#copyout 
#user 

#uncopyout 

tail -3 .ocopy >X1 

#cmp XI roosevelt 

#log 

#next 

3.2b 2 



#copyqut 
#uncopyout 

Between these commands, any material typed at the student by any program is copied to the file 
.ocopy. This lets the script writer interrogate the effect of what the student typed, which true 
believers in the performance theory of learning usually prefer to the student's actual input. 

#pipe 
#unpipe 

Normally the student input and the script commands are fed to the UNIX command interpreter 
(the "shell") one line at a time. This won't do if, for example, a sequence of editor commands 
is provided, since the input to the editor must be handed to the editor, not to the shell. 
Accordingly, the material between #pipe and #unpipe commands is fed continuously through a 
pipe so that such sequences work. If copyout is also desired the copyout brackets must include 
the pipe brackets. 

There are several commands for setting status after the student has attempted the lesson. 
#cmp file I file2 

is an in-line implementation of cmp, which compares two files for identity. 
#match stuff 

The last line of the student's input is compared to stuff, and the success or fail status is set 
according to it. Extraneous things like the word answer are stripped before the comparison is 
made. There may be several #match lines; this provides a convenient mechanism for handling 
multiple "right" answers. Any text up to a # on subsequent lines after a successful #match is 
printed; this is illustrated in Figure 4, another sample lesson. 

#bad stuff 

This is similar to #match. except that it corresponds to specific failure answers; this can be 
used to produce hints for particular wrong answers that have been anticipated by the script 



Figure 4: Another Sample Lesson 
#print 

What command will move the current line 
to the end of the file? Type 

"answer COMMAND", where COMMAND is the command. 

#copyin 

#user 

#uncopyin 

#match mS 

#match .mS 

"m$" is easier. 

#log 

#next 

63.1d 10 



writer. 

# succeed 
#fail 

print a message upon success or failure (as determined by some previous mechanism). 

When the student types one of the "commands'* yes, no, ready , or answer, the driver 
terminates the #user command, and evaluation of the student's work can begin. This can be 
done either by the built-in commands above, such as #match and #cmp, or by status returned 
by normal UNIX commands, typically grep and test. The last command should return status true 
(0) if the task was done successfully and false (non-zero) otherwise; this status return tells the 
driver whether or not the student has successfully passed the lesson. 

Performance can be logged: 

mogfile 

writes the date, lesson, user name and speed rating, and a success/ failure indication on file. 
The command 

mog 

by itself writes the logging information in the togging directory within the learn hierarchy, and 
is the normal form. 

#next 

is followed by a few lines, each with a successor lesson name and an optional speed rating on it. 
A typical set might read 

25.1a 10 
25.2a 5 
25.3a 2 

indicating that unit 25.1a is a suitable follow-on lesson for students with a speed rating of 10 
units, 25.2a for student with speed near 5, and 25.3a for speed near 2. Speed ratings are main- 
tained for each Session with a student; the rating is increased by one each time the student gets 
a lesson right and decreased by four each time the student gets a lesson wrong. Thus the driver 
tries to maintain a level such that the users get 80% right answers. The maximum rating is lim- 
ited to 10 and the minimum to 0. The initial rating is zero unless the student specifies a 
different rating when starting a session. 

If the student passes a lesson, a new lesson is selected and the process repeats. If the stu- 
dent fails, a false status is returned and the program reverts to the previous lesson and tries 



- 10- 



another alternative. If it can not find another alternative, it skips forward a lesson. The stu- 
dent can terminate a session at any time by typing bye, which causes a graceful exit from learn. 
Hanging up is the usual novice's way out. 

The lessons may form an arbitrary directed graph, although the present program imposes 
a limitation on cycles in that it will not present a lesson twice in the same session. If the stu- 
dent is unable to answer one of the exercises correctly, the driver searches for a previous lesson 
with a set of alternatives as successors (following the #next line). From the previous lesson 
with alternatives one route was taken earlier; the program simply tries a different one. 

It is perfectly possible to write sophisticated scripts that evaluate the student's speed of 
response, or try to estimate the elegance of the answer, or provide detailed analysis of wrong 
answers. Lesson writing is so tedious already, however, that most of these abilities are likely to 
go unused. 

The driver program depends heavily on features of the unix system that are not available 
on many other operating systems. These include the ease of manipulating files and directories, 
file redirection, the ability to use the command interpreter as just another program (even in a 
pipeline), command status testing and branching, the ability to catch signals like interrupts, and 
of course the pipeline mechanism itself. Although some parts of learn might be transferable to . 
other systems, some generality will probably be lost. 

A bit of history: The first version of learn had fewer built-in commands in the driver pro- 
gram, and made more use of the facilities of the UNIX system itself. For example, file com- 
parison was done by creating a cmp process, rather than comparing the two files within learn. 
Lessons were not stored as text files, but as archives. There was no concept of the in-line 
document; even #print had to be followed by a file name. Thus the initialization for each les- 
son was to extract the archive into the working directory (typically 4-8 files), then #print the 
lesson text. 

The combination of such things made learn rather slow and demanding of system 
resources. The new version is about 4 or 5 times faster, because fewer files and processes are 
created. Furthermore, it appears even faster to the user because in a typical lesson, the printing 
of the message comes first, and file setup with #create can be overlapped with printing, so that 
when the program finishes printing, it is really ready for the user to type at it. 

It is also a great advantage to the script maintainer that lessons are now just ordinary text 
files, rather than archives. They can be edited without any difficulty, and UNIX text manipula- 
tion tools can be applied to them. The result has been that there is much less resistance to 
going in and fixing substandard lessons. 

6. Conclusions 

The following observations can be made about secretaries, typists, and other non- 
programmers who have used learn : 

(a) A novice must have assistance with the mechanics of communicating with the computer 
to get through to the first lesson or two; once the first few lessons are passed people can 
proceed on their own. 

(b) The terminology used in the first few lessons is obscure to those inexperienced with com- 
puters. It would help if there were a low level reference card for UNIX to supplement the 
existing programmer oriented bulky manual and bulky reference card. 

(c) The concept of "substitutable argument" is hard to grasp, and requires help. 

(d) They enjoy the system for the most part. Motivation matters a great deal, however. 

It takes an hour or two for a novice to get through the script on file handling. The total time 
for a reasonably intelligent and motivated novice to proceed from ignorance to a reasonable 
ability to create new files and manipulate old ones seems to be a few days, with perhaps half of 
each day spent on the machine. 
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The normal way of proceeding has been to have students in the same room with someone 
who knows the UNIX system and the scripts. Thus the student is not brought to a halt by 
difficult questions. The burden on the counselor, however, is much lower than that on a 
teacher of a course. Ideally, the students should be encouraged to proceed with instruction 
immediately prior to their actual use of the computer. They should exercise the scripts on the 
same computer and the same kind of terminal that they will later use for their real work, and 
their first few jobs for the computer should be relatively easy ones. Also, both training and ini- 
tial work should take place on days when the hardware and software are working reliably. 
Rarely is all of this possible, but the closer one comes the better the result. For example, if it 
is known that the hardware is shaky one day, it is better to attempt to reschedule training for 
another one. Students are very frustrated by machine downtime; when nothing is happening, it 
takes some sophistication and experience to distinguish an infinite loop, a slow but functioning 
program, a program waiting for the user, and a broken machine.* 

One disadvantage of training with learn is that students come to depend completely on the 
CAI system, and do not try to read manuals or use other learning aids. This is unfortunate, not 
only because of the increased demands for completeness and accuracy of the scripts, but 
because the scripts do not cover all of the UNIX system. New users should have manuals 
(appropriate for their level) and read them; the scripts ought to be altered to recommend suit- 
able documents and urge students to read them. 

There are several other difficulties which are clearly evident. From the student's 
viewpoint, the most serious is that lessons still crop up which simply can't be passed. Some- 
times this is due to poor explanations, but just as often it is some error in the lesson itself — a 
botched setup, a missing file, an invalid test for correctness, or some system facility that 
doesn't work on the local system in the same way it did on the development system. It takes 
knowledge and a certain healthy arrogance on the part of the user to recognize that the fault is 
not his or hers, but the script writer's. Permitting the student to get on with the next lesson 
regardless does alleviate this somewhat, and the logging facilities make it easy to watch for les- 
sons that no one can pass, but it is still a problem. 

The biggest problem with the previous learn was speed (or lack thereof) — it was often 
excruciatingly slow and a significant drain on the system. The current version so far does not 
seem to have that difficulty, although some scripts, notably eqn , are intrinsically slow, eqn , for 
example, must do a lot of work even to print its introductions, let alone check the student 
responses, but delay is perceptible in all scripts from time to time. 

Another potential problem is that it is possible to break learn inadvertently, by pushing 
interrupt at the wrong time, or by removing critical files, or any number of similar slips. The 
defenses against such problems have steadily been improved, to the point where most students 
should not notice difficulties. Of course, it will always be possible to break learn maliciously, 
but this is not likely to be a problem. 

One area is more fundamental — some commands are sufficiently global in their effect 
that learn currently does not allow them to be executed at all. The most obvious is c</, which 
changes to another directory. The prospect of a student who is learning about directories inad- 
vertently moving to some random directory and removing files has deterred us from even writ- 
ing lessons on cd, but ultimately lessons on such topics probably should be added. 

7. Acknowledgments 

We are grateful to all those who have tried learn, for we have benefited greatly from their 
suggestions and criticisms. In particular, M. E. Bittrich, J. L. Blue, S. I. Feldman, P. A. Fox, 
and M. J. McAIpin have provided substantial feedback. Conversations with E. Z. Rothkopf also 
provided many of the ideas in the system. We are also indebted to Don Jackowski for serving 



* We have even known an expert programmer to decide the computer was broken when he had simply left 
his terminal in local mode. Novices have great difficulties with such problems. 
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ABSTRACT 

Lint is a command which examines C source programs, detecting a 
number of bugs and obscurities. It enforces the type rules of C more strictly 
than the C compilers. It may also be used to enforce a number of portability 
restrictions involved in moving programs between different machines and/or 
operating systems. Another option detects a number of wasteful, or error 
prone, constructions which nevertheless are, strictly speaking, legal. 

Lint accepts multiple input files and library specifications, and checks them 
for consistency. 

The separation of function between lint and the C compilers has both his- 
torical and practical rationale. The compilers turn C programs into executable 
files rapidly and efficiently. This is possible in part because the compilers do 
not do sophisticated type checking, especially between separately compiled pro- 
grams. Lint takes a more global, leisurely view of the program, looking much 
more carefully at the compatibilities. 

This document discusses the use of lint, gives an overview of the imple- 
mentation, and gives some hints on the writing of machine independent C 
code. 
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Introduction and Usage 

Suppose there are two C 1 source riles, fileLz and file 2. c, which are ordinarily compiled and 
loaded together. Then the command 

lint filel.c filelc 

produces messages describing inconsistencies and inefficiencies in the programs. The program 
enforces the typing rules of C more strictly than the C compilers (for both historical and practi- 
cal reasons) enforce them. The command 

lint — p filel.c file2.c 

will produce, in addition to the above messages, additional messages which relate to the porta- 
bility of the programs to other operating systems and machines. Replacing the — p by — h will 
produce messages about various error-prone or wasteful constructions which, strictly speaking, 
are not bugs. Saying —hp gets the whole works. 

The next several sections describe the major messages; the document closes with sections 
discussing the implementation and giving suggestions for writing portable C. An appendix 
gives a summary of the lint options. 

A Word About Philosophy 

Many of the facts which lint needs may be impossible to discover. For example, whether 
a given function in a program ever gets called may depend on the input data. Deciding whether 
exit is ever called is equivalent to solving the famous "halting problem," knew; to be recur- 
sively undecidable. 

Thus, most of the lint algorithms are a compromise. If a function is never mentioned, it 
can never be called. If a function is mentioned, lint assumes it can be called; this is not neces- 
sarily so, but in practice is quite reasonable. 

Lint tries to give information with a high degree of relevance. Messages of the form "xet 
might be a bug" are easy to generate, but are acceptable only in proportion to the fraction of 
real bugs they uncover. If this fraction of real bugs is too small, the messages lose their credi- 
bility and serve merely to clutter up the output, obscuring the more important messages. 

Keeping these issues in mind, we now consider in more detail the classes of messages 
which lint produces. 

Unused Variables and Functions 

As sets of programs evolve and develop, previously used variables and arguments to func- 
tions may become unused; it is not uncommon for external variables, or even entire functions, 
to become unnecessary, and yet not be removed from the source. These "errors of commis- 
sion" rarely cause working programs to fail, but they are a source of inefficiency, and make 
programs harder to understand and change. Moreover, information about such unused vari- 
ables and functions can occasionally serve to discover bugs; if a function does a necessary job, 
and is never called, something is wrong! 
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Lint complains about variables and functions which are defined but not otherwise men- 
tioned. An exception is variables which are declared through explicit extern statements but are 
never referenced; thus the statement 

extern float sin(); 

will evoke no comment if sin is never used. Note that this agrees with the semantics of the C 
compiler. In some cases, these unused external declarations might be of some interest; they 
can be discovered by adding the — x flag to the lint invocation. 

Certain styles of programming require many functions to be written with similar inter- 
faces; frequently, some of the arguments may be unused in many of the calls. The — v option 
is available to suppress the printing of complaints about unused arguments. When — v is in 
effect, no messages are produced about unused arguments except for those arguments which 
are unused and also declared as register arguments; this can be considered an active (and 
preventable) waste of the register resources of the machine. 

There is one case where information about unused, or undefined, variables is more dis- 
tracting than helpful. This is when lint is applied to some, but not all, files out of a collection 
which are to be loaded together. In this case, many of the functions and variables defined may 
not be used, and, conversely, many functions and variables defined elsewhere may be used. 
The — u flag may be used to suppress the spurious messages which might otherwise appear. 

Set/Used Information 

Lint attempts to detect cases where a variable is used before it is set. This is very difficult 
to do well; many algorithms take a good deal of time and space, and still produce messages 
about perfectly valid programs. Lint detects local variables (automatic and register storage 
classes) whose first use appears physically earlier in the input file than the first assignment to 
the variable. It assumes that taking the address of a variable constitutes a "use," since the 
actual use may occur at any later time, in a data dependent fashion. 

The restriction to the physical appearance of variables in the file makes the algorithm very 
simple and quick to implement, since the true flow of control need not be discovered. It does 
mean that lint can complain about some programs which are legal, but these programs would 
probably be considered bad on stylistic grounds (e.g. might contain at least two goto's). 
Because static and external variables are initialized to 0, no meaningful information can be 
discovered about their uses. The algorithm deals correctly, however, with initialized automatic 
variables, and variables which are used in the expression which first sets them. 

The set/used information also permits recognition of those local variables which are set 
and never used; these form a frequent source of inefficiencies, and may also be symptomatic of 
bugs. 

Flow of Control 

Lint attempts to detect unreachable portions of the programs which it processes. It will 
complain about unlabeled statements immediately following goto, break, continue, or return 
statements. An attempt is made to detect loops which can never be left at the bottom, detect- 
ing the special cases while( 1 ) and for(;;) as infinite loops. Lint also complains about loops 
which cannot be entered at the top; some valid programs may have such loops, but at best they 
are bad style, at worst bugs. 

Lint has an important area of blindness in the flow of control algorithm: it has no way of 
detecting functions which are called and never return. Thus, a call to exit may cause unreach- 
able code which lint does not detect; the most serious effects of this are in the determination of 
returned function values (see the next section). 

One form of unreachable statement is not usually complained about by lint; a break state- 
ment that cannot be reached causes no message. Programs generated by yacc, 2 and especially 
/ex, 3 may have literally hundreds of unreachable break statements. The -O flag in the C 
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compiler will often eliminate the resulting object code inefficiency. Thus, these unreached 
statements are of little importance, there is typically nothing the user can do about them, and 
the resulting messages would clutter up the lint output. If these messages are desired, lint can 
be invoked with the — b option. 

Function Values 

Sometimes functions return values which are never used; sometimes programs incorrectly 
use function "values'* which have never been returned. Lint addresses this problem in a 
number of ways. 

Locally, within a function definition, the appearance of both 
return ( expr ); 

and 

return ; 

statements is cause for alarm; lint will give the message 

function name contains return (e) and return 

The most serious difficulty with this is detecting when a function return is implied by flow of 
control reaching the end of the function. This can be seen with a simple example: 

f (a) { 

if ( a ) return ( 3 ); 

g 0; . 

} 

Notice that, if a tests false, /will call g and then return with no defined return value; this will 
trigger a complaint from lint. If g, like exit, never returns, the message will still be produced 
when in fact nothing is wrong. 

In practice, some potentially serious bugs have been discovered by this feature; it also 
accounts for a substantial fraction of the "noise" messages produced by lint. 

On a global scale, lint detects cases where a function returns a value, but this value is 
sometimes, or always, unused. When the value is always unused, it may constitute an 
inefficiency in the function definition. When the value is sometimes unused, it may represent 
bad style (e.g., not testing for error conditions). 

The dual problem, using a function value when the function does not return one, is also 
detected. This is a serious problem. Amazingly, this bug has been observed on a couple of 
occasions in "working" programs; the desired function value just happened to have been com- 
puted in the function return register! 

Type Checking 

Lint enforces the type checking rules of C more strictly than the compilers do. The addi- 
tional checking is in four major areas: across certain binary operators and implied assignments, 
at the structure selection operators, between the definition and uses of functions, and in the use 
of enumerations. 

There are a number of operators which have an implied balancing between types of the 
operands. The assignment, conditional ( ? : ), and relational operators have this property; the 
argument of a return statement, and expressions used in initialization also suffer similar 
conversions. In these operations, char, short, int, long, unsigned, float, and double types may 
be freely intermixed. The types of pointers must agree exactly, except that arrays of x*s can, of 
course, be intermixed with pointers to ^s. 

The type checking rules also require that, in structure references, the left operand of the 
— > be a pointer to structure, the left operand of the . be a structure, and the right operand of 
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these operators be a member of the structure implied by the left operand. Similar checking is 
done for references to unions. 

Strict rules apply to function argument and return value matching. The types float and 
double may be freely matched, as may the types char, short, int, and unsigned. Also, pointers 
can be matched with the associated arrays. Aside from this, all actual arguments must agree in 
type with their declared counterparts. 

With enumerations, checks are made that enumeration variables or members are not 
mixed with other types, or other enumerations, and that the only operations applied are — , ini- 
tialization, , and function arguments and return values. 

Type Casts 

The type cast feature in C was introduced largely as an aid to producing more portable 
programs. Consider the assignment 

p-1; 

where p is a character pointer. Lint will quite rightly complain. Now, consider the assignment 
p — (char *) 1 ; 

in which a cast has been used to convert the integer to a character pointer. The programmer 
obviously had a strong motivation for doing this, and has clearly signaled his intentions. It 
seems harsh for lint to continue to complain about this. On the other hand, if this code is 
moved to another machine, such code should be looked at carefully. The — c flag controls the 
printing of comments about casts. When — c is in effect, casts are treated as though they were 
assignments subject to complaint; otherwise, all legal casts are passed without comment, no 
matter how strange the type mixing seems to be. 

Nonportable Character Use 

On the PDP-11, characters are signed quantities, with a range from —128 to 127. On 
most of the other C implementations, characters take on only positive values. Thus, lint will 
flag certain comparisons and assignments as being illegal or nonportable. For example, the 
fragment 

char c; 

if( (c - getcharO) < 0) .... 

works on the PDP-11, but will fail on machines where characters always take on positive 
values. The real solution is to declare c an integer, since getchar is actually returning integer 
values. In any case, lint will say 4 'nonportable character comparison". 

A similar issue arises with bitfields; when assignments of constant values are made to 
bitfields, the field may be too small to hold the value. This is especially true because on some 
machines bitfields are considered as signed quantities. While it may seem unintuitive to con- 
sider that a two bit field declared of type int cannot hold the value 3, the problem disappears if 
the bitfield is declared to have type unsigned. 

Assignments of longs to ints 

Bugs may arise from the assignment of long to an int, which loses accuracy. This may 
happen in programs which have been incompletely converted to use typedefs. When a typedef 
variable is changed from int to long, the program can stop working because some intermediate 
results may be assigned to ints, losing accuracy. Since there are a number of legitimate reasons 
for assigning longs to ints, the detection of these assignments is enabled by the —a flag. 
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Strange Constructions 

Several perfectly legal, but somewhat strange, constructions are flagged by lint; the mes- 
sages hopefully encourage better code quality, clearer style, and may even point out bugs. The 
— h flag is used to enable these checks. For example, in the statement 

•P++ ; 

the * does nothing; this provokes the message "null effect" from lint The program fragment 

unsigned x ; 
if{ x < 0 ) ... 

is clearly somewhat strange; the test will never succeed. Similarly, the test 

if ( x > 0 ) ... 
is equivalent to 

if(x !- 0) 

which may not be the intended action. Lint will say "degenerate unsigned comparison" in 
these cases. If one says 

if(l!-0).... 

lint will report "constant in conditional context", since the comparison of 1 with 0 gives a con- 
stant result. 

Another construction detected by lint involves operator precedence. Bugs which arise 
from misunderstandings about the precedence of operators can be accentuated by spacing and 
formatting, making such bugs extremely hard to find. For example, the statements 

if( x&077 --0) ... 

or 

x«2 + 40 

probably do not do what was intended. The best solution is to parenthesize such expressions, 
and lint encourages this by an appropriate message. 

Finally, when the — h flag is in force lint complains about variables which are redeciared in 
inner blocks in a way that conflicts with their use in outer blocks. This is legal, but is con- 
sidered by many (including the author) to be bad style, usually unnecessary, and frequently a 
bug. 

Ancient History 

There are several forms of older syntax which are being officially discouraged. These fall 
into two classes, assignment operators and initialization. 

The older forms of assignment operators (e.g., — +, ——,...) could cause ambiguous 
expressions, such as 

a — I; 

which could be taken as either 

a 1; 

or 

a 1; 

The situation is especially perplexing if this kind of ambiguity arises as the result of a macro 
substitution. The newer, and preferred operators (+—, — ■■», etc. ) have no such ambiguities. 
To spur the abandonment of the older forms, lint complains about these old fashioned 
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operators. 

A similar issue arises with initialization. The older language allowed 
int x 1 ; 

to initialize xto 1. This also caused syntactic difficulties: for example, 

int x ( -1 ) ; 
looks somewhat like the beginning of a function declaration: 

int x ( y ) ( . . . 

and the compiler must read a fair ways past x in order to sure what the declaration really is.. 
Again, the problem is even more perplexing when the initializer involves a macro. The current 
syntax places an equals sign between the variable and the initializer: 

int x — — 1 ; 

This is free of any possible syntactic ambiguity. 

Pointer Alignment 

Certain pointer assignments may be reasonable on some machines, and illegal on others, 
due entirely to alignment restrictions. For example, on the PDP-11, it is reasonable to assign 
integer pointers to double pointers, since double precision values may begin on any integer 
boundary. On the Honeywell 6000, double precision values must begin on even word boun- 
daries; thus, not all such assignments make sense. Lint tries to detect cases where pointers are 
assigned to other pointers, and such alignment problems might arise. The message "possible 
pointer alignment problem** results from this situation whenever either the — p or — h flags are 
in effect. 

Multiple Uses and Side Effects 

In complicated expressions, the best order in which to evaluate subexpressions may be 
highly machine dependent. For example, on machines (like the PDP-11) in which the stack 
runs backwards, function arguments will probably be best evaluated from right-to-left; on 
machines with a stack running forward, left-to-right seems most attractive. Function calls 
embedded as arguments of other functions may or may not be treated similarly to ordinary 
arguments. Similar issues arise with other operators which have side effects, such as the assign- 
ment operators and the increment and decrement operators. 

In order that the efficiency of C on a particular machine not be unduly compromised, the 
C language leaves the order of evaluation of complicated expressions up to the local compiler, 
and, in fact, the various C compilers have considerable differences in the order in which they 
will evaluate complicated expressions. In particular, if any variable is changed by a side effect, 
and also used elsewhere in the same expression, the result is explicitly undefined. 

Lint checks for the important special case where a simple scalar variable is affected. For 
example, the statement 

alt) - ; 

will draw the complaint: 

warning: / evaluation order undefined 

Implementation 

Lint consists of two programs and a driver. The first program is a version of the Portable 
C Compiler 4 ' 5 which is the basis of the IBM 370, Honeywell 6000, and Interdata 8/32 C com- 
pilers. This compiler does lexical and syntax analysis on the input text, constructs and main- 
tains symbol tables, and builds trees for expressions. Instead of writing an intermediate file 
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which is passed to a code generator, as the other compilers do, lint produces an intermediate file 
which consists of lines of ascii text. Each line contains an external variable name, an encoding 
of the context in which it was seen (use, definition, declaration, etc.), a type specifier, and a 
source file name and line number. The information about variables local to a function or file is 
collected by accessing the symbol table, and examining the expression trees. 

Comments about local problems are produced as detected. The information about exter- 
nal names is collected onto an intermediate file. After all the source files and library descrip- 
tions have been collected, the intermediate file is sorted to bring all information collected about 
a given external name together. The second, rather small, program then reads the lines from 
the intermediate file and compares all of the definitions, declarations, and uses for consistency. 

The driver controls this process, and is also responsible for making the options available 
to both passes of lint. 

Portability 

C on the Honeywell and IBM systems is used, in part, to write system code for the host 
operating system. This means that the implementation of C tends to follow local conventions 
rather than adhere strictly to UNixt system conventions. Despite these differences, many C 
programs have been successfully moved to GCOS and the various IBM installations with little 
effort. This section describes some of the differences between the implementations, and 
discusses the lint features which encourage portability. 

Uninitialized external variables are treated differently in different implementations of C. 
Suppose two files both contain a declaration without initialization, such as 

int a ; 

outside of any function. The UNIX loader will resolve these declarations, and cause only a sin- 
gle word of storage to be set aside for a. Under the GCOS and IBM implementations, this is 
not feasible (for various stupid reasons!) so each such declaration causes a word of storage to 
be set aside and called a. When loading or library editing takes place, this causes fatal conflicts 
which prevent the proper operation of the program. If lint is invoked with the — p flag, it will 
detect such multiple definitions. 

A related difficulty comes from the amount of information retained about external names 
during the loading process. On the UNIX system, externally known names have seven 
significant characters, with the upper/lower case distinction kept. On the IBM systems, there 
are eight significant characters, but the case distinction is lost. On GCOS, there are only six 
characters, of a single case. This leads to situations where programs run on the UNIX system, 
but encounter loader problems on the IBM or GCOS systems. Lint — p causes all external sym- 
bols to be mapped to one case and truncated to six characters, providing a worst-case analysis. 

A number of differences arise in the area of character handling: characters in the UNIX 
system are eight bit ascii, while they are eight bit ebcdic on the IBM, and nine bit ascii on 
GCOS. Moreover, character strings go from high to low bit positions Cleft to right") on 
GCOS and IBM, and low to high ("right to left") on the PDP-11. This means that code 
attempting to construct strings out of character constants, or attempting to use characters as 
indices into arrays, must be looked at with great suspicion. Lint is of little help here, except to 
flag multi-character character constants. 

Of course, the word sizes are different! This causes less trouble than might be expected, 
at least when moving from the UNIX system (16 bit words) to the IBM (32 bits) or GCOS (36 
bits). The main problems are likely to arise in shifting or masking. C now supports a bit-field 
facility, which can be used to write much of this code in a reasonably portable way. Frequently, 
portability of such code can be enhanced by slight rearrangements in coding style. Many of the 
incompatibilities seem to have the flavor of writing 
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x &- 0177700 ; 

to clear the low order six bits of x This suffices on the PDP-11, but fails badly on GCOS and 
IBM. If the bit field feature cannot be used, the same effect can be obtained by writing 

x &- - 077 ; 

which will work on all these machines. 

The right shift operator is arithmetic shift on the PDP-11, and logical shift on most other 
machines. To obtain a logical shift on all machines,' the left operand can be typed unsigned. 
Characters are considered signed integers on the PDP-11, and unsigned on the other machines. 
This persistence of the sign bit may be reasonably considered a bug in the PDP-11 hardware 
which has infiltrated itself into the C language. If there were a good way to discover the pro- 
grams which would be affected, C could be changed; in any case, lint is no help here. 

The above discussion may have made the problem of portability seem bigger than it in 
fact is. The issues involved here are rarely subtle or mysterious, at least to the impiementor of 
the program, although they can involve some work to straighten out. The most serious bar to 
the portability of UNIX system utilities has been the inability to mimic essential UNIX system 
functions on the other systems. The inability to seek to a random character position in a text 
file, or to establish a pipe between processes, has involved far more rewriting and debugging 
than any of the differences in C compilers. On the other hand, lint has been very helpful in 
moving the UNIX operating system and associated utility programs to other machines. 

Shutting Lint Up 

There are occasions when the programmer is smarter than lint There may be valid rea- 
sons for "illegal" type casts, functions with a variable number of arguments, etc. Moreover, as 
specified above, the flow of control information produced by lint often has blind spots, causing 
occasional spurious messages about perfectly reasonable programs. Thus, some way of com- 
municating with lint, typically to shut it up, is desirable. 

The form which this mechanism should take is not at all clear. New keywords would 
require current and old compilers to recognize these keywords, if only to ignore them. This has 
both philosophical and practical problems. New preprocessor syntax suffers from similar prob- 
lems. 

What was finally done was to cause a number of words to be recognized by lint when they 
were embedded in comments. This required minimal preprocessor changes; the preprocessor 
just had to agree to pass comments through to its output, instead of deleting them as had been 
previously done. Thus, lint directives are invisible to the compilers, and the effect on systems 
with the older preprocessors is merely that the lint directives don't work. 

The first directive is concerned with flow of control information; if a particular place in 
the program cannot be reached, but this is not apparent to lint, this can be asserted by the 
directive 

/• NOTREACHED V 

at the appropriate spot in the program. Similarly, if it is desired to turn off strict type checking 
for the next expression, the directive 

/• NOSTRICT V 

can be used; the situation reverts to the previous default after the next expression. The — v 
flag can be turned on for one function by the directive 

/• ARGSUSED •/ 

Complaints about variable number of arguments in calls to a function can be turned off by the 
directive 
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/• VARARGS 7 

preceding the function definition. In some cases, it is desirable to check the first several argu- 
ments, and leave the later arguments unchecked. This can be done by following' the 
VARARGS keyword immediately with a digit giving the number of arguments which should be 
checked; thus, 

/• VARARGS2 V 

will cause the first two arguments to be checked, the others unchecked. Finally, the directive 
/• LINTLIBRARY V 

at the head of a file identifies this file as a library declaration file; this topic is worth a section by 
itself. 

Library Declaration Files 

Lint accepts certain library directives, such as 

-ly 

and tests the source files for compatibility with these libraries. This is done by accessing library 
description files whose names are constructed from the library directives. These files all begin 
with the directive 

/• LINTLIBRARY V 

which is followed by a series of dummy function definitions. The critical parts of these 
definitions are the declaration of the function return type, whether the dummy function returns 
a value, and the number and types of arguments to the function. The VARARGS and 
ARGSUSED directives can be used to specify features of the library functions. 

Lint library files are processed almost exactly like ordinary source files. The only 
difference is that functions which are defined on a library file, but are not used on a source file, 
draw no complaints. Lint does not simulate a full library search algorithm, and complains if the 
source files contain a redefinition of a library routine (this is a feature!). 

By default, lint checks the programs it is given against a standard library file, which con* 
tains descriptions of the programs which are normally loaded when a C program is run. When 
the -p flag is in effect, another file is checked containing descriptions of the standard I/O library 
routines which are expected to be portable across various machines. The -a flag can be used to 
suppress ail library checking. 

Bugs, etc. 

Lint was a difficult program to write, partially because it is closely connected with matters 
of programming style, and partially because users usually don't notice bugs which cause lint to 
miss errors which it should have caught. (By contrast, if lint incorrectly complains about some- 
thing that is correct, the programmer reports that immediately!) 

A number of areas remain to be further developed. The checking of structures and arrays 
is rather inadequate; size incompatibilities go unchecked, and no attempt is made to match up 
structure and union declarations across files. Some stricter checking of the use of the typedef is 
clearly desirable, but what checking is appropriate, and how to carry it out, is still to be deter- 
mined. 

Lint shares the preprocessor with the C compiler. At some point it may be appropriate for 
a special version of the preprocessor to be constructed which checks for things such as unused 
macro definitions, macro arguments which have side effects which are not expanded at all, or 
are expanded more than once, etc. 

The central problem with lint is the packaging of the information which it collects. There 
are many options which serve only to turn off, or slightly modify, certain features. There are 
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pressures to add even more of these options. 

In conclusion, it appears that the general notion of having two programs is a good one. 
The compiler concentrates on quickly and accurately turning the program text into bits which 
can be run; lint concentrates on issues of portability, style, and efficiency. Lint can afford to be 
wrong, since incorrectness and over-conservatism are merely annoying, not fatal. The compiler 
can be fast since it knows that lint will cover its flanks. Finally, the programmer can concen- 
trate at one stage of the programming process solely on the algorithms, data structures, and 
correctness of the program, and then later retrofit, with the aid of lint, the desirable properties 
of universality and portability. 
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Appendix: Current Lint Options 

The command currently has the form 

lint [—options ] files... library-descriptors... 

The options are 

h Perform heuristic checks 

p Perform portability checks 

y Don't report unused arguments 

u Don't report unused or undefined externals 

b Report unreachable break statements. 

x Report unused external declarations 

a Report assignments of long to int or shorter. 

c Complain about questionable casts 

n No library checking is done 

s Same as h (for historical reasons) 
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ABSTRACT 

In a programming project, it is easy to lose track of which files need to be 
reprocessed or recompiled after a change is made in some part of the source. 
Make provides a simple mechanism for maintaining up-to-date versions of pro- 
grams that result from many operations on a number of Mies. It is possible to 
tell Make the sequence of commands that create certain Mies, and the list of 
Mies that require other Mies to be current before the operations can be done. 
Whenever a change is made in any part of the program, the Make command 
will create the proper Mies simply, correctly, and with a minimum amount of 
effort. 

The basic operation of Make is to Mnd the name of a needed target in the 
description, ensure that all of the files on which it depends exist and are up to 
date, and then create the target if it has not been modified since its generators 
were. The description file really defines the graph of dependencies; Make does 
a depth-first search of this graph to determine what work is really necessary. 

Make also provides a simple macro substitution facility and the ability to 
encapsulate commands in a single file for convenient administration. 
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Introduction 

It is common practice to divide large programs into smaller, more manageable pieces. 
The pieces may require quite different treatments: some may need to be run through a macro 
processor, some may need to be processed by a sophisticated program generator (e.g., Yaccll] 
or Lexl2]). The outputs of these generators may then have to be compiled with special options 
and with certain definitions and declarations. The code resulting from these transformations 
may then need to be loaded together with certain libraries under the control of special options. 
Related maintenance activities involve running complicated test scripts and installing validated 
modules. Unfortunately, it is very easy for a programmer to forget which files depend on 
which others, which files have been modified recently, and the exact sequence of operations 
needed to make or exercise a new version of the program. After a long editing session, one 
may easily lose track of which files have been changed and which object modules are still valid, 
since a change to a declaration can obsolete a dozen other files. Forgetting to compile a routine 
that has been changed or that uses changed declarations will result in a program that will not 
work, and a bug that can be very hard to track down. On the other hand, recompiling every- 
thing in sight just to be safe is very wasteful. 

The program described in this report mechanizes many of the activities of program 
development and maintenance. If the information on inter-file dependences and command 
sequences is stored in a file, the simple command 

make 

is frequently sufficient to update the interesting files, regardless of the number that have been 
edited since the last "make". In most cases, the description file is easy to write and changes 
infrequently. It is usually easier to type the make command than to issue even one of the 
needed operations, so the typical cycle of program development operations becomes 

think — edit — make — test . . . 

Make is most useful for medium-sized programming projects; it does not solve the prob- 
lems of maintaining multiple source versions or of describing huge programs. Make was 
designed for use on Unix, but a version runs on GCOS. 

Basic Features 

The basic operation of make is to update a target file by ensuring that all of the files on 
which it depends exist and are up to date, then creating the target if it has not been modified 
since its dependents were. Make does a depth-first search of the graph of dependences. The 
operation of the command depends on the ability to find the date and time that a file was last 
modified. 

To illustrate, let us consider a simple example: A program named prog is made by compil- 
ing and loading three C-language files x.c, y.t\ and z.c with the IS library. By convention, the 
output of the C compilations will be found in files named .v.a, y.a, and z.o. Assume that the 
files x.c and y.c share some declarations in a file named defa but that z.c does not. That is, x.c 
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and y.c have the line 

#include "defs" 
The following text describes the relationships and operations: 

prog : x.o y.o z.o 

cc x.o y.o z.o —IS —o prog 

x.o y.o : defs 

If this information were stored in a file named makefile, the command 
make 

would perform the operations needed to recreate prog after any changes had been made to any 
of the four source files .r.c, y.c, z.c, or defs. 

Make operates using three sources of information: a user-supplied description file (as 
above), file names and "last- modified" times from the file system, and built-in rules to bridge 
some of the gaps. In our example, the first line says that prog depends on three ".o" files. 
Once these object files are current, the second line describes how to load them to create prog. 
The third line says that x.o and y.o depend on the file defs. From the file system, make discov- 
ers that there are three ".c" files corresponding to the needed ".o" files, and uses built-in 
information on how to generate an object from a source file (i.e., issue a "cc — c" command). 

The following long-winded description file is equivalent to the one above, but takes no 
advantage of make's innate knowledge: 

prog : x.o y.o z.o 

cc x.o y.o z.0 ;HS —o prog 

x.o : x.c defs 

cc —c x.c 
y.o : y.c defs 

cc — c y.c 
z.0 : z.c 

cc — C Z.C 

If none of the source or object files had changed since the last time prog was made, all of 
the files would be current, and the command 

make 

would just announce this fact and stop. If, however, the defs file had been edited, x.c and y.c 
(but not z,c) would be recompiled, and then prog would be created from the new ".o" files. If 
only the file y.c had changed, only it would be recompiled, but it would still be necessary to 
reload prog. 

If no target name is given on the make command line, the first target mentioned in the 
description is created; otherwise the specified targets are made. The command 

make x.o 

would recompile x.o if x.c or defs had changed. 

If the file exists after the commands are executed, its time of last modification is used in 
further decisions; otherwise the current time is used. It is often quite useful to include rules 
with mnemonic names and commands that do not actually produce a file with that name. 
These entries can take advantage of make's ability to generate files and substitute macros. 
Thus, an entry "save" might be included to copy a certain set of files, or an entry "cleanup" 
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might be used to throw away unneeded intermediate files. In other cases one may maintain a 
zero-length file purely to keep track of the time at which certain actions were performed. This 
technique is useful for maintaining remote archives and listings. 

Make has a simpie macro mechanism for substituting in dependency lines and command 
strings. Macros are defined by command arguments or description file lines with embedded 
equal signs. A macro is invoked by preceding the name by a dollar sign: macro names longer 
than one character must be parenthesized. The name of the macro is either the single character 
after the dollar sign or a name inside parentheses. The following are valid macro invocations: 

S(CFLAGS) 
52 

S(xy) 

SZ 

S(Z) 

The last two invocations are identical. SS is a dollar sign. All of these macros are assigned 
values during input, as shown below. Four special macros change values during the execution 
of the command: $*, S<§, S?, and S<. They will be discussed later. The following fragment 
shows the use: 

OBJECTS - x.o y.o z.o 
L1BES - -IS 
prog: S(OBJECTS) 

ccS(OBJECTS) S(LIBES) -o prog 

The command 
make 

loads the three object files with the IS library. The command 

make "LJBES- -II -IS" 

loads them with both the Lex (" — II") and the Standard (" — IS") libraries, since macro 
definitions on the command line override definitions in the description. (It is necessary to 
quote arguments with embedded blanks in UNixf commands.) 

The following sections detail the form of description files and the command line, and dis- 
cuss options and built-in rules in more detail. 

Description Files and Substitutions 

A description file contains three types of information: macro definitions, dependency 
information, and executable commands. There is also a comment convention: all characters 
after a-shurp (#) are ignored, as is the sharp itself. Blank lines and lines beginning with a sharp 
are totally ignored. If a non-comment line is too long, it can be continued using a backslash. If 
the last character of a line is a backslash, the backslash, newline, and following blanks and tabs 
are replaced by a single blank. 

A macro definition is a line containing an equal sign not preceded by a colon or a tab. 
The name (string of letters and digits) to the left of the equal sign (trailing blanks and tabs are 
stripped) is assigned the string of characters following the equal sign (leading blanks and tabs 
are stripped.) The following are valid macro definitions: 
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2 - xyz 

abc - -11 -ly -IS 
LIBES - 

The last definition assigns LIBES the null string. A macro that is never explicitly defined has 
the null string as value. Macro definitions may also appear on the make command line (see 
below). 

Other lines give information about target files. The general form of an entry is: 

targetl (target? . . .] :(:] (dependentl ...](; commands] [#...] 
{(tab) commands] (#...] 

Items inside brackets may be omitted. Targets and dependents are strings of letters, digits, 
periods, and slashes. (Shell metacharacters and "?" are expanded.) A command is any 
string of characters not including a sharp (except in quotes) or newline. Commands may 
appear either after a semicolon on a dependency line or on lines beginning with a tab immedi- 
ately following a dependency line. 

A dependency line may have either a single or a double colon. A target name may appear 
on more than one dependency line, but all of those lines must be of the same (single or double 
colon) type. 

1. For the usual single-colon case, at most one of these dependency lines may have a com- 
mand sequence associated with it. If the target is out of date with any of the dependents 
on any of the lines, and a command sequence is specified (even a null one following a 
semicolon or tab), it is executed: otherwise a default creation rule may be invoked. 

2. In the double-colon case, a command sequence may be associated with each dependency 
line; if the target is out of date with any of the files on a particular line, the associated 
commands are executed. A built-in rule may also be executed. This detailed form is of 
particular value in updating archive-type files. 

If a target must be created, the sequence of commands is executed. Normally, each com- 
mand line is printed and then passed to a separate invocation of the Shell after substituting for 
macros. (The printing is suppressed in silent mode or if the command line begins with an @ 
sign). Make normally stops if any command signals an error by returning a non-zero error 
code. (Errors are ignored if the i" flags has been specified on the make command line, if 
the fake target name IGNORE" appears in the description file, or if the command string in 
the description file begins with a hyphen. Some UNIX commands return meaningless status). 
Because each command line is passed to a separate invocation of the Shell, care must be taken 
with certain commands (e.g., a/ and Shell control commands) that have meaning only within a 
single Shell process: the results are forgotten before the next line is executed. 

Before issuing any command, certain macros are set. S@ is set to the name of the file to 
be "made". S? is set to the string of names that were found to be younger than the target. If 
the command was generated by an implicit rule (see below), S< is the name of the related file 
that caused the action, and S* is the prefix shared by the current and the dependent file names. 

If a file must be made but there are no explicit commands or relevant built-in rules, the 
commands associated with the name ".DEFAULT" are used. If there is no such name, make 
prints a message and stops. 

Command Usage 

The make command takes four kinds of arguments: macro definitions, flags, description 
file names, and target file names. 

make [ flags ] [ macro definitions ] [ targets ] 
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The following summary of the operation of the command explains how these arguments are 
interpreted. , 

First, all macro definition arguments (arguments with embedded equal signs) are analyzed 
and the assignments made. Command-line macros override corresponding definitions found in 
the description files. 

Next, the flag arguments are examined. The permissible flags are 

— i Ignore error codes returned by invoked commands. This mode is entered if the fake tar- 

get name ".IGNORE" appears in the description file. 

— s Silent mode. Do not print command lines before executing. This mode is also entered if 
the fake target name ".SILENT" appears in the description file. 

— r Do not use the built-in rules. 

— n No execute mode. Print commands, but do not execute them. Even lines beginning with 

an "@" sign are printed. 

— t Touch the target files (causing them to be up to date) rather than issue the usual com- 

mands. 

-q Question. The make command returns a zero or non-zero status code depending on 
whether the target file is or is not up to date. 

— p Print out the complete set of macro definitions and target descriptions 

— d Debug mode. Print out detailed information on files and times examined. 

— f Description file name. The next argument is assumed to be the name of a description 

file. A file name of " — " denotes the standard input. If there are no f" arguments, 
the file named makefile or Make/He in the current directory is read. The contents of the 
description files override the built-in rules if they are present). 

Finally, the remaining arguments are assumed to be the names of targets to be made; they 
are done in left to right order. If there are no such arguments, the first name in the description 
files that does not begin with a period is "made". 

Implicit Rules 

The make program uses a table of interesting suffixes and a set of transformation rules to 
supply default dependency information and implied commands. (The Appendix describes these 
tables and means of overriding them.) The default suffix list is: 



.0 


Object file 


.c 


C source file 


.e 


Efl source file 


.r 


Ratfor source file 


•f 


Fortran source file 


.5 


Assembler source file 


.y 


Yacc-C source grammar 


-yr 


Yacc-Ratfor source grammar 


.ye 


Yacc-Efl source grammar 


.1 


Lex source grammar 



The following diagram summarizes the default transformation paths. If there are two paths 
connecting a pair of suffixes, the longer one is used only if the intermediate file exists or is 
named in the description. -* 
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.r .e ./ .s .v .yr .ve^il .< 

h II 



.y .1 .yr .ye 

If the file x.o were needed and there were an x.c in the description or directory, it would 
be compiled. If there were also an x.L that grammar would be run through Lex before compil- 
ing the result. However, if there were no xc but there were an *./, make would discard the 
intermediate C-language file and use the direct link in the graph above. 

It is possible to change the names of some of the compilers used in the default, or the flag 
arguments with which they are invoked by knowing the macro names used. The compiler 
names are the macros AS, CC, RC, EC, YACC, YACCR, YACCE, and LEX. The command 



make CC—newcc 

will cause the "newcc" command to be used instead of the usual C compiler. The macros 
CFLAGS, R FLAGS, EFLAGS, Y FLAGS, and L FLAGS may be set to cause these commands 
to be issued with optional flags. Thus, 

make "CFLAGS- -O" 
causes the optimizing C compiler to be used. 

Example 

As an example of the use of make, we will present the description file used to maintain 
the make command itself. The code for make is spread over a number of C source files and a 
Yacc grammar. The description file contains: 
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# Description file for the Make command 

P - und -3 | opr -r2 # send to GCOS to be printed 

FILES — Makefile version.c defs main.c doname.c misc.c files.c dosys.cgram.y lex.c gcos.c 
OBJECTS -» version.o main.o doname.o misc.o files.o dosys.o gram.o 
L1BES- -IS 
LINT — lint -p 
CFLAGS - -O 

make: S(OBJECTS) 

cc S(CFLAGS) S(OBJECTS) S(LIBES) -o make 
size make 

S(OBJECTS): defs 
gram.o: lex.c 

cleanup: 

-rm *.o gram.c 
-du 

install: 

@size make /usr/bin/make 

cp make /usr/bin/make ; rm make 

print: S(FILES) # print recently changed files 
pr S? | SP 
touch print 

test: 

make —dp | grep — v TIME >lzap 
/usr/bin/make -dp | grep — v TIME >2zap 
diff lzap 2zap 
rm lzap 2zap 

lint : dosys.c doname.c files.c main.c miscc version.c gram.c 

S(LINT) dosys.c doname.c files.c main.c misc.c version.c gram.c 
rm gram.c 

arch: 

ar uv /sys/source/s2/make.a S(FILES) 

Make usually prints out each command before issuing it. The following output results from 
typing the simple command 

make 

in a directory containing only the source and description file: 

cc — c version.c 
cc — c main.c 
cc — c donamcc 
cc — c miscc 
cc — c files.c 
cc — c dosys.c 
yacc gram.y 
mv y.tab.c gram.c 
cc — c gram.c 

cc version.o main.o doname.o misc.o files.o dosys.o gram.o —IS — o make 
13188+3348+3044 - 19580b - 046174b 

Although none of the source files or grammars were mentioned by name in the description file, 
make found them using its suffix rules and issued the needed commands. The string of digits 



- 8 - 



results from the "size make" command; the printing of the command line itself was suppressed 
by an @ sign. The <§ sign on the size command in the description file suppressed the printing 
of the command, so only the sizes are written. 

The last few entries in the description file are useful maintenance sequences. The "print" 
entry prints only the files that have been changed since the last "make print" command. A 
zero-length file print is maintained to keep track of the time of the printing; the S? macro in the 
command line then picks up only the names of the files changed since print was touched. The 
printed output can be sent to a different printer or to a file by changing the definition of the P 
macro: 

make print "P " opr — sp" 
or 

make print "P— cat >zap" 

Suggestions and Warnings 

The most common difficulties arise from make's specific meaning of dependency. If file 
x.c has a "#include "defs"" line, then the object file x.o depends on defy, the source file x.c 
does not. (If defs is changed, it is not necessary to do anything to the file xc, while it is neces- 
sary to recreate x.o.) 

To discover what make would do, the 44 — n" option is very useful. The command 
make — n 

orders make to print out the commands it would issue without actually taking the time to exe- 
cute them. If a change to a file is absolutely certain to be benign (e.g., adding a new definition 
to an include file), the 44 — t" (touch) option can save a lot of time: instead of issuing a large 
number of superfluous recompilations, make updates the modification times on the affected file. 
Thus, the command 

make -ts 

("touch silently") causes the relevant files to appear up to date. Obvious care is necessary, 
since this mode of operation subverts the intention of make and destroys all memory of the 
previous relationships. 

The debugging flag ("-d") causes make to print out a very detailed description of what it 
is doing, including the file times. The output is verbose, and recommended only as a last 
resort. 
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Appendix. Suffixes and Transformation Rules 

The make program itself does not know what file name suffixes are interesting or how to 
transform a file with one suffix into a file with another suffix. This information is stored in an 
internal table that has the form of a description file. If the " — r" flag is used, this table is not 
used. 

The list of suffixes is actually the dependency list for the name ".SUFFIXES"; make 
looks for a file with any of the suffixes on the list. If such a file exists, and if there is a 
transformation rule for that combination, make acts as described earlier. The transformation 
rule names are the concatenation of the two suffixes. The name of the rule to transform a "./■" 
file to a file is thus *\r.o". If the rule is present and no explicit command sequence has 
been given in the user's description files, the command sequence for the rule ".r.o" is used. If 
a command is generated by using one of these suffixing rules, the macro S* is given the value 
of the stem (everything but the suffix) of the name of the file to be made, and the macro S< is 
the name of the dependent that caused the action. 

The order of the suffix list is significant, since it is scanned from left to right, and the first 
name that is formed that has both a file and a rule associated with it is used. If new names are 
to be appended, the user can just add an entry for ".SUFFIXES" in his own description file; 
the dependents will be added to the usual list. A ".SUFFIXES" line without any dependents 
deletes the current list. (It is necessary to clear the current list if the order of names is to be 
changed). 

The following is an excerpt from the default rules file: 

.SUFFIXES : .0 .c .e .r .f .y .yr .ye .1 .s 

YACOyacc 

YACCR-yacc -r 

YACCE-yacc -e 

YFLAGS- 

LEX-lex 

LFLAGS- 

CC-cc 

AS —as — 

CFLAGS- 

RC-ec 

RFLAGS- 

EC-ec 

EFLAGS- 

FFLAGS- 

.c.o : 

S(CC) S(CFLAGS) -cS< 
.e.o .r.o .f.o : 

S(EC) S(RFLAGS) S(EFLAGS) S(FFLAGS) -cS< 

.s.o : 

S(AS) -0 S@ S< 

.y.o : 

S(YACC) S(YFLAGS) S< 
S(CC) S(CFLAGS) -c y.tab.c 
rm y.tab.c 
mv y.tab.o S@ 

.y.c : 

S(YACC) S(YFLAGS) S< 
mv y.tab.c S@ 
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1. INTRODUCTION 

This paper describes how to write programs that interface with the UNIX operating system 
in a non-trivial way. This includes programs that use files by name, that use pipes, that invoke 
other commands as they run, or that attempt to catch interrupts and other signals during execu- 
tion. 

The document collects material which is scattered throughout several sections of The UNIX 
Programmer's Manual ll] for Version 7 UNIX. There is no attempt to be complete; only gen- 
erally useful material is dealt with. It is assumed that you will be programming in C, so you 
must be able to read the language roughly up to the level of The C Programming Language [2). 
Some of the material in sections 2 through 4 is based on topics covered more carefully there. 
You should also be familiar with UNIX itself at least to the level of UNIX for Beginners [3]. 

2. BASICS 

2.1. Program Arguments 

When a C program is run as a command, the arguments on the command line are made 
available to the function main as an argument count argc and an array argv of pointers to 
character strings that contain the arguments. By convention, argv[0] is the command name 
itself, so argc is always greater than 0. 

The following program illustrates the mechanism: it simply echoes its arguments back to 
the terminal. (This is essentially the echo command.) 

main (argc, argv) /* echo arguments */ 
int argc; 
char *argv[]; 

{ 

int i; 

for (i ■ 1; i < argc; i++) 

printf ( M %s%c M , argvCi], (i<argc-1) ? ' ' : '\n'); 

I 

argv is a pointer to an array whose individual elements are pointers to arrays of characters; 
each is terminated by \0, so they can be treated as strings. The program starts by printing 
argv[l ] and loops until it has printed them all. 

The argument count and the arguments are parameters to main. If you want to keep them 
around so other routines can get at them, you must copy them to external variables. 

2.2. The "Standard Input" and "Standard Output" 

The simplest input mechanism is to read the "standard input," which is generally the 
user's terminal. The function getchar returns the next input character each time it is called. 
A file may be substituted for the terminal by using the < convention: if prog uses getchar. 
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then the command line 
prog <£ile 

causes prog to read file instead of the terminal, prog itself need know nothing about 
where its input is coming from. This is also true if the input comes from another program via 
the pipe mechanism: 

otherprog l prog 

provides the standard input for prog from the standard output of otherprog . 

getchar returns the value EOF when it encounters the end of file (or an error) on what- 
ever you are reading. The value of EOF is normally defined to be -1 , but it is unwise to take 
any advantage of that knowledge. As will become clear shortly, this value is automatically 
defined for you when you compile a program, and need not be of any concern. 

Similarly, putchar(c) puts the character c on the "standard output,** which is also by 
default the terminal. The output can be captured on a file by using >: if prog uses putchar, 

prog >outfile 

writes the standard output on outf ile instead of the terminal, outf ile is created if it 
doesn't exist; if it already exists, its previous contents are overwritten. And a pipe can be used: 

prog I otherprog 

puts the standard output of prog into the standard input of otherprog . 

The function printf , which formats output in various ways, uses the same mechanism as 
putchar does, so calls to printf and putchar may be intermixed in any order, the output 
will appear in the order of the calls. 

Similarly, the function scanf provides for formatted input conversion; it will read the 
standard input and break it up into strings, numbers, etc., as desired, scanf uses the same 
mechanism as getchar, so calls to them may also be intermixed. 

Many programs read only one input and write one output; for such programs I/O with 
getchar, putchar, scanf, and printf may be entirely adequate, and it is almost always 
enough to get started. This is particularly true if the UNIX pipe facility is used to connect the 
output of one program to the input of the next. For example, the following program strips out 
all ascii control characters from its input (except for newline and tab). 

# include <stdio.h> 

nainO /* cestrip: strip non-graphic characters */ 

( 

int c; 

while <(c - getchar ()) !- EOF) 

if ((c >-''&& c < 0177) II c — '\t' II c — '\n') 
putchar (c); 

exit(0); 

) 

The line 

#include <stdio.h> 

should appear' at the beginning of each source file. It causes the C compiler to read a file 
Uusr/include/stdio.h) of standard routines and symbols that includes the definition of EOF. 

If it is necessary to treat multiple files, you can use cat to collect the files for you: 

cat f ilel f ile2 ... I cestrip >output 

and thus avoid learning how to access files from a program. By the way, the call to exit at the 
end is not necessary to make the program work properly, but it assures that any caller of the 
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program will see a normal termination status (conventionally 0) from the program when it com- 
pletes. Section 6 discusses status returns in more detail. 

3 . THE STANDARD I/O LIBRARY 

The "Standard I/O Library" is a collection of routines intended to provide efficient and 
portable I/O services for most C programs. The standard I/O library is available on each sys- 
tem that supports C, so programs that confine their system interactions to its facilities can be 
transported from one system to another essentially without change. 

In this section, we will discuss the basics of the standard I/O library. The appendix con- 
tains a more complete description of its capabilities. 

3.1. File Access 

The programs written so far have all read the standard input and written the standard out- 
put, which we have assumed are magically pre-defined. The next step is to write a program that 
accesses a file that is not already connected to the program. One simple example is wc, which 
counts the lines, words and characters in a set of files. For instance, the command 

wc x.c y.c 

prints the number of lines, words and characters in x.c and y.c and the totals. 

The question is how to arrange for the named files to be read — that is, how to connect the 
file system names to the I/O statements which actually read the data. 

The rules are simple. Before it can be read or written a file has to be opened by the stan- 
dard library function fopen. fopen takes an external name (like x.c or y.c), does some 
housekeeping and negotiation with the operating system, and returns an internal name which 
must be used in subsequent reads or writes of the file. 

This internal name is actually a pointer, called a file pointer, to a structure which contains 
information about the file, such as the location of a buffer, the current character position in the 
buffer, whether the file is being read or written, and the like. Users don't need to know the 
details, because part of the standard I/O definitions obtained by including stdio . h is a struc- 
ture definition called FILE. The only declaration needed for a file pointer is exemplified by 

FILE *fp, *fopen(); 

This says that fp is a pointer to a FILE, and fopen returns a pointer to a FILE. (FILE is a 
type name, like int, not a structure tag. 

The actual call to fopen in a program is 

fp » fopen (name, nod*); 

The first argument of fopen is the name of the file, as a character string. The second argu- 
ment is the mode, also as a character string, which indicates how you intend to use the file. 
The only allowable modes are read ("r"), write ("w"), or append ("a"). 

If a file that you open for writing or appending does not exist, it is created (if possible). 
Opening an existing file for writing causes the old contents to be discarded. Trying to read a 
file that does not exist is an error, and there may be other causes of error as well (like trying to 
read a file when you don't have permission). If there is any error, fopen will return the null 
pointer value NULL (which is defined as zero in stdio. h). 

The next thing needed is a way to read or write the file once it is open. There are several 
possibilities, of which getc and putc are the simplest, getc returns the next character from 
a file; it needs the file pointer to tell it what file. Thus 

c ■ getc(fp) 

places in c the next character from the file referred to by f p; it returns EOF when it reaches 
end of file, putc is the inverse of getc: 
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putc(c, fp) 

0 

puts the character c on the file f p and returns c. getc and putc return EOF on error. 

When a program is started, three Mies are opened automatically, and file pointers are pro- 
vided for them. These files are the standard input, the standard output, and the standard error 
output; the corresponding file pointers are called stdin, stdout, and stderr. Normally 
these are all connected to the terminal, but may be redirected to files or pipes as described in 
Section 2.2. stdin, stdout and stderr are pre-defined in the I/O library as the standard 
input, output and error files; they may be used anywhere an object of type FILE * can be. 
They are constants, however, not variables, so don't try to assign to them. 

With some of the preliminaries out of the way, we can now write wc. The basic design is 
one that has been found convenient for many programs: if there are command-line arguments, 
they are processed in order. If there are no arguments, the standard input is processed. This 
way the program can be used stand-alone or as part of a larger process. 

# include <stdio.h> 

main (argc, argv) /* wc: count lines, words, chars */ 
int argc; 
char *argv[] ; 
{ 

int c, i, inword; 

FILE *fp, *fopen(); 

long linect, wordct, charct; 

long tlinect ■ 0, twordct • 0, tcharct - 0; 

i - 1; 

fp - stdin; 

do ( 

if (argc > -1 £& (f p-f open (argv [i] , "r")) — NULL) { 
fprintf (stderr, M wc: can't open %s\n M , argvCil); 
continue; 

) 

linect «■ wordct - charct ■ inword ■ 0; 
while ((c * getc(fp)) !» EOF) ( 

charct++; 

if (c — '\n') 
linect++;. 

if (c — » ' lie— '\t' lie— '\n') 

inword - 0; 
else if (inword — 0) ( 

inword ■ 1; 

wordct++; 

) 

) 

printf ("%71d %71d %71d M , linect, wordct, charct); 

printf(argc > 1 ? » %s\n w : "\n M , argv[i]); 

fclose(fp) ; 

tlinect linect; 

twordct +- wordct; 

tcharct +« charct; 
) while (++i < argc) ; 
if (argc > 2) 

printf ("%71d %71d %71d total \n M , tlinect, twordct, tcharct); 
exit(0); 

) 

The function f printf is identical to printf, save that the first argument is a file pointer that 
specifies the file to be written. 
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The function fclose is the inverse of fopen; it breaks the connection between the file 
pointer and the external name that was established by fopen, freeing the file pointer for 
another file. Since there is a limit on the number of files that a program may have open simul- 
taneously, it's a good idea to free things when they are no longer needed. There is also another 
reason to call fclose on an output file — it flushes the buffer in which putc is collecting out- 
put, (fclose is called automatically for each open file when a program terminates normally.) 

3.2. Error Handling — Stderr and Exit 

stderr is assigned to a program in the same way that stdin and stdout are. Output 
written on stderr appears on the user's terminal even if the standard output is redirected, wc 
writes its diagnostics on stderr instead of stdout so that if one of the files can't be accessed 
for some reason, the message finds its way to the user's terminal instead of disappearing down 
a pipeline or into an output file. 

The program actually signals errors in another way, using the function exit to terminate 
program execution. The argument of exit is available to whatever process called it (see Sec- 
tion 6), so the success or failure of the program can be tested by another program that uses this 
one as a sub-process. By convention, a return value of 0 signals that all is well; non-zero 
values signal abnormal situations. 

exit itself calls fclose for each open output file, to flush out any buffered output, then 
calls a routine named .exit. The function .exit causes immediate termination without any 
buffer flushing; it may be called directly if desired. 

3.3. Miscellaneous I/O Functions 

The standard I/O library provides several other I/O functions besides those we have illus- 
trated above. 

Normally output with putc, etc., is buffered (except to stderr); to force it out immedi- 
ately, use f flush ( f p ) . 

fscanf is identical to scanf , except that its first argument is a file pointer (as with 
fprintf) that specifies the file from which the input comes; it returns EOF at end of file. 

The functions sscanf and sprintf are identical to fscanf and fprintf, except that 
the first argument names a character string instead of a file pointer. The conversion is done 
from the string for sscanf and into it for sprintf. 

fgets(buf , size, fp) copies the next line from fp, up to and including a newiine, 
into buf; at most size-1 characters are copied; it returns NULL at end of file, 
fputs (buf, fp) writes the string in buf onto file fp. 

The function ungetc (c, fp) "pushes back'* the character c onto the input stream fp; a 
subsequent call to getc, fscanf, etc., will encounter c. Only one character of pushback per 
file is permitted. 

4. LOW-LEVEL I/O 

This section describes the bottom level of I/O on the UNIX system. The lowest level of 
I/O in UNIX provides no buffering or any other services; it is in fact a direct entry into the 
operating system. You are entirely on your own, but on the other hand, you have the most 
control over what happens. And since the calls and usage are quite simple, this isn't as bad as 
it sounds. 

4.1. File Descriptors 

In the UNIX operating system, all input and output is done by reading or writing files, 
because all peripheral devices, even the user's terminal, are files in the file system. This means 
that a single, homogeneous interface handles all communication between a program and peri- 
pheral devices. 
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In the most general case, before reading or writing a file, it is necessary to inform the sys- 
tem of your intent to do so, a process called "opening" the Ale. If you are going to write on a 
file, it may also be necessary to create it. The system checks your right to do so (Does the file 
exist? Do you have permission to access it?), and if all is well, returns a small positive integer 
called a file descriptor. Whenever I/O is to be done on the file, the file descriptor is used instead 
of the name to identify the file. (This is roughly analogous to the use of R£AD(5,...) and 
WRITE(6,...) in Fortran.) All information about an open file is maintained by the system; the 
user program refers to the file only by the file descriptor. 

The file pointers discussed in section 3 are similar in spirit to file descriptors, but file 
descriptors are more fundamental. A file pointer, is a pointer to a structure that contains, 
among other things, the file descriptor for the file in question. 

Since input and output involving the user's terminal are so common, special arrangements 
exist to make this convenient. When the command interpreter (the "shell") runs a program, it 
opens three files, with file descriptors 0, 1, and 2, called the standard input, the standard out- 
put, and the standard error output. All of these are normally connected to the terminal, so if a 
program reads file descriptor 0 and writes file descriptors 1 and 2, it can do terminal I/O 
without worrying about opening the files. 

If I/O is redirected to and from files with < and >, as in 

prog <infile xautfile 

the shell changes the default assignments for file descriptors 0 and 1 from the terminal to the 
named files. Similar observations hold if the input or output is associated with a pipe. Nor- 
mally file descriptor 2 remains attached to the terminal, so error messages can go there. In all 
cases, the file assignments are changed by the shell, not by the program. The program does not 
need to know where its input comes from nor where its output goes, so long as it uses file 0 for 
input and 1 and 2 for output. 

4.2. Read and Write 

All input and output is done by two functions called read and write. For both, the first 
argument is a file descriptor. The second argument is a buffer in your program where the data 
is to come from or go to. The third argument is the number of bytes to be transferred. The 
calls are 

n.read ■ read(fd, buf, n) ; 

n.written - write (fd, buf, n) ; 

Each call returns a byte count which is the number of bytes actually transferred. On reading, 
the number of bytes returned may be less than the number asked for, because fewer than n 
bytes remained to be read. (When the file is a terminal, read normally reads only up to the 
next newline, which is generally less than what was requested.) A return value of zero bytes 
implies end of file, and -1 indicates an error of some sort. For writing, the returned value is 
the number of bytes actually written; it is generally an error if this isn't equal to the number 
supposed to be written. 

The number of bytes to be read or written is quite arbitrary. The two most common values 
are 1, which means one character at a time ("unbuffered"), and 512, which corresponds to a 
physical blocksize on many peripheral devices. This latter size will be most efficient, but even 
character at a time I/O is not inordinately expensive. 

Putting these facts together, we can write a simple program to copy its input to its output 
This program will copy anything to anything, since the input and output can be redirected to 
any file or device. 



•define BUFSIZE 512 /* best size for PDP-11 UNIX */ 

mainO /* copy input to output */ 
i 

char buf [BUFSIZE] ; 
int n; 

while <(n - read(0, buf, BUFSIZE) ) > 0) 

write (1, buf, n) ; 
exit(Q); 

) 

If the file size is not a multiple of BUFSIZE, some read will return a smaller number of bytes 
to be written by write; the next call to read after that will return zero. 

It is instructive to see how read and write can be used to construct higher level routines 
like getchar, putchar, etc. For example, here is a version of getchar which does 
unbuffered input. 

♦define CMASK 0377 /* for making char's > 0 */ 

getchar 0 /* unbuffered single character input */ 
( 

char c; 

return (( read (0, &c, 1) > 0) ? c & CMASK : EOF); 

) 

c must be declared char, because read accepts a character pointer. The character being 
returned must be masked with 0377 to ensure that it is positive; otherwise sign extension may 
make it negative. (The constant 0377 is appropriate for the PDP-11 but not necessarily for 
other machines.) 

The second version of getchar does input in big chunks, and hands out the characters 
one at a time. 

•define CMASK 0377 /* for making char's > 0 */ 
•define BUFSIZE 512 

getchar () /* buffered version */ 

1 

static char buf [BUFSIZE] ; 
static char *bufp - buf; 
static int n ■ 0; 

if (n — 0) ( /* buffer is empty */ 
n - read(0, buf, BUFSIZE); 
bufp ■ buf; 

) 

retura(( — n >- 0) ? *bufp++ & CMASK : EOF); 

) 

4.3. Open, Creat, Close, Unlink 

Other than the default standard input, output and error files, you must explicitly open files 
in order to read or write them. There are two system entry points for this, open and creat 
Isicl. 

open is rather like the f open discussed in the previous section, except that instead of 
returning a file pointer, it returns a file descriptor, which is just an int. 
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int fd; 

fd * open (name, rwmode); 

As with f open, the name argument is a character string corresponding to the external file 
name. The access mode argument is different, however: rwmode is 0 for read, 1 for write, and 
2 for read and write access, open returns -1 if any error occurs; otherwise it returns a valid 
file descriptor. 

It is an error to try to open a file that does not exist. The entry point creat is provided 
to create new files, or to re-write old ones. 

fd - creat (name, pmode); 

returns a file descriptor if it was able to create the file called name, and -1 if not. If the file 
already exists, creat will truncate it to zero length; it is not an error to creat a file that 
already exists. 

If the file is brand new, creat creates it with the protection mode specified by the pmode 
argument. In the UNIX file system, there are nine bits of protection information associated 
with a file, controlling read, write and execute permission for the owner of the file, for the 
owner's group, and for all others. Thus a three-digit octal number is most convenient for 
specifying the permissions. For example, 0755 specifies read, write and execute permission for 
the owner, and read and execute permission for the group and everyone else. 

To illustrate, here is a simplified version of the UNIX utility cp, a program which copies one 
file to another. (The main simplification is that our version copies only one file, and does not 
permit the second argument to be a directory.) 

»define MULL 0 
♦define BUFSZZE S12 

♦define PMODE 0644 /* RW for owner, R for group, others */ 

main (argc, argv) /"# cp: copy f1 to f2 */ 
int argc; 
char *argv(]; 

{ 

int f1, f2, n; 
char buf [BUFSZZE] ; 

if (argc !- 3) 

error ("Usage: cp from to", NULL); 
if ((f1 - open(argv(1], 0)) — -1) 

error ("cp: can't open %s", argv[1]); 
if ((f2 - creat (argv [ 2] , PMODE) ) — -1) 

error ("cp: can't create %s", argv[2]); 

while ((n - read(f1, buf, BUFSZZE) ) > 0) 
if (write(£2, buf, n) I- n) 

error ( " cp : write error" , NULL ) ; 

exit(0); 

) 

error (si, s2) /* print error message and die •/ 
char *s1 , *s2; 

I 

printf(s1, s2); 
printf ("\n") ; 
exitd); 

) 
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As we said earlier, there is a limit (typically 15-25) on the number of files which a program 
may have open simultaneously. Accordingly, any program which intends to process many files 
must be prepared to re-use file descriptors. The routine close breaks the connection between 
a file descriptor and an open file, and frees the file descriptor for use with some other file. Ter- 
mination of a program via exit or return from the main program closes all open files. 

The function unlink (filename) removes the file filename from the file system. 

4.4. Random Access — Seek and Lseek 

File I/O is normally sequential: each read or write takes place at a position in the file 
right after the previous one. When necessary, however, a file can be read or written in any 
arbitrary order. The system call lseek provides a way to move around in a file without actu- 
ally reading or writing: 

lseek (fd, offset, origin); 

forces the current position in the file whose descriptor is fd to move to position offset, 
which is taken relative to the location specified by origin. Subsequent reading or writing will 
begin at that position, offset is a long; fd and origin are int's. origin can be 0, 1, 
or 2 to specify that offset is to be measured from the beginning, from the current position, 
or from the end of the file respectively. For example, to append to a file, seek to the end 
before writing: 

lseek (fd, 0L, 2); 
To get back to the beginning ("rewind"), 

lseek (fd, 0L, 0); 
Notice the 0L argument; it could also be written as (long) 0. 

With lseek, it is possible to treat files more or less like large arrays, at the price of slower 
access. For example, the following simple function reads any number of bytes from any arbi- 
trary place in a file. 

get(fd, pos, buf, n) /* read n bytes from position pos */ 

int fd, n; 

long pos; 

char *buf; 

( 

lseek (fd, pos, 0); /* get to pos */ 
return ( read ( f d , buf , n) ) ; 

} 

In pre-version 7 UNIX, the basic entry point to the I/O system is called seek, seek is 
identical to lseek, except that its offset argument is an int rather than a long. Accord- 
ingly, since PDP-11 integers have only 16 bits, the offset specified for seek is limited to 
65,535; for this reason, origin values of 3, 4, 5 cause seek to multiply the given offset by 
512 (the number of bytes in one physical block) and then interpret origin as if it were 0, 1, 
or 2 respectively. Thus to get to an arbitrary place in a large file requires two seeks, first one 
which selects the block, then one which has origin equal to 1 and moves to the desired byte 
within the block. 

4.5. Error Processing 

The routines discussed in this section, and in fact all the routines which are direct entries 
into the system can incur errors. Usually they indicate an error by returning a value of —1. 
Sometimes it is nice to know what sort of error occurred; for this purpose all these routines, 
when appropriate, leave an error number in the external ceil errno. The meanings of the 
various error numbers are listed in the introduction to Section II of the UNIX Programmer's 
Manual, so your program can, for example, determine if an attempt to open a file failed 
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because it did not exist or because the user lacked permission to read it. Perhaps more com* 
monly, you may want to print out the reason for failure. The routine perror will print a mes- 
sage associated with the value of ermo; more generally, sys.errno is an array of character 
strings which can be indexed by ermo and printed by your program. 

5. PROCESSES 

It is often easier to use a program written by someone else than to invent one's own. This 
section describes how to execute a program from within another. 

5.1. The "System'* Function 

The easiest way to execute a program from another is to use the standard library routine 
system, system takes one argument, a command string exactly as typed at the terminal 
(except for the newline at the end) and executes it. For instance, to time-stamp the output of 
a program, 

mainO 

{ 

system ( "date") ; 

/* rest of processing */ 

1 

If the command string has to be built from pieces, the in-memory formatting capabilities of 
sprintf may be useful. 

Remember than a/etc and putc normally buffer their input; terminal I/O will not be prop- 
erly synchronized unless this buffering is defeated. For output, use f flush; for input, see 
setbuf in the appendix. 

5.2. Low-Level Process Creation — Execl and Execv 

If you're not using the standard library, or if you need finer control over what happens, you 
will have to construct calls to other programs using the more primitive routines that the stan- 
dard library's system routine is based on. 

The most basic operation is to execute another program without returning, by using the rou- 
tine execl. To print the date as the last action of a running program, use 

execl ("/bin/date" , "date", NULL) ; 

The first argument to execl is the file name of the command; you have to know where it is 
found in the file system. The second argument is conventionally the program name (that is, 
the last component of the file name), but this is seldom used except as a place-holder. If the 
command takes arguments, they are strung out after this; the end of the list is marked by a 
NULL argument. 

The execl call overlays the existing program with the new one, runs that, then exits. 
There is no return to the original program. 

More realistically, a program might fall into two or more phases that communicate only 
through temporary files. Here it is natural to make the second pass simply an execl call from 
the first. 

The one exception to the rule that the original program never gets control back occurs 
when there is an error, for example if the file can't be found or is not executable. If you don't 
know where date is located, say 

execl ("/bin/date", "date", NOLL); 
execl ("/usr/bin/date", "date", NOLL) ; 
fprintf (stderr, "Someone stole 'date'Xn"); 

A variant of execl called execv is useful when you don't know in advance how many 
arguments there are going to be. The call is 
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execv (filename, argp); 

where argp is an array of pointers to the arguments; the last pointer in the array must be 
NULL so execv can tell where the list ends. As with exec 1. filename is the file in which 
the program is found, and argp CO] is the name of the program. (This arrangement is identi- 
cal to the argv array for program arguments.) 

Neither of these routines provides the niceties of normal command execution. There is no 
automatic search of multiple directories — you have to know precisely where the command is 
located. Nor do you get the expansion of metacharacters like <, >, *, ?» and [] in the argu- 
ment list. If you want these, use execl to invoke the shell sh, which then does all the work. 
Construct a string comraandline that contains the complete command as it would have been 
typed at the terminal, then say 

execl /bin/sh", "sh", "-c", commandline , NULL) ; 

The shell is assumed to be at a fixed place, /bin/sh. Its argument -c says to treat the next 
argument as a whole command line, so it does just what you want. The only problem is in con- 
structing the right information in commandline. 

5.3. Control of Processes — Fork and Wait 

So far what we've talked about isn't really all that useful by itself. Now we will show how 
to regain control after running a program with execl or execv. Since these routines simply 
overlay the new program on the old one, to save the old one requires that it first be split into 
two copies; one of these can be overlaid, while the other waits for the new, overlaying program 
to finish. The splitting is done by a routine called fork: 

proc.id * fork ( ) ; 

splits the program into two copies, both of which continue to run. The only difference between 
the two is the value of proc.id, the "process id." In one of these processes (the "child"), 
proc.id is zero. In the other (the "parent"), proc.id is non-zero; it is the process number 
of the child. Thus the basic way to call, and return from, another program is 

if (forkO — 0) 

execl (" /bin/sh" , "sh", M -c M , end, NULL); /* in child */ 

And in fact, except for handling errors, this is sufficient. The fork makes two copies of the 
program. In the child, the value returned by fork is zero, so it calls execl which does the 
command and then dies. In the parent, fork returns non-zero so it skips the execl. (If 
there is any error, fork returns -1 ). 

More often, the parent wants to wait for the child to terminate before continuing itself. 
This can be done with the function wait: 

int status; 

if (forkO — 0) 
execl (...); 
wait Ustatus) ; 

This still doesn't handle any abnormal conditions, such as a failure of the execl or fork, or 
the possibility that there might be more than one child running simultaneously. (The wait 
returns the process id of the terminated child, if you want to check it against the value returned 
by fork.) Finally, this fragment doesn't deal with any funny behavior on the part of the child 
(which is reported in status). Still, these three lines are the heart of the standard library's 
system routine, which we'll show in a moment. 

The status returned by wait encodes in its low-order eight bits the system's idea of the 
child's termination status: it is 0 for normal termination and non-zero to indicate various kinds 
of problems. The next higher eight bits are taken from the argument of the call to exit which 
caused a normal termination of the child process. It is good coding practice for all programs to 



retum meaningful status. 

When a program is called by the shell, the three file descriptors 0, 1, and 2 are set up point- 
ing at the right files, and all other possible file descriptors are available for use. When this pro- 
gram calls another one. correct etiquette suggests making sure the same conditions hold. Nei- 
ther fork nor the exec calls affects open files in any way. If the parent is buffering output 
that must come out before output from the child, the parent must flush its buffers before the 
execl. Conversely, if a caller buffers an input stream, the called program will lose any infor- 
mation that has been read by the caller. 

S.4. Pipes 

A pipe is an I/O channel intended for use between two cooperating processes: one process 
writes into the pipe, while the other reads. The system looks after buffering the data and syn- 
chronizing the two processes. Most pipes are created by the shell, as in 

Is I pr 

which connects the standard output of Is to the standard input of pr. Sometimes, however, it 
is most convenient for a process to set up its own plumbing; in this section, we will illustrate 
how the pipe connection is established and used. 

The system call pipe creates a pipe. Since a pipe is used for both reading and writing, two 
file descriptors are returned; the actual usage is like this: 

int fa [2]; 

stat - pips(fd); 
if (stat — -1) 

/* there was an error . . . */ 

fd is an array of two file descriptors, where fdCO] is the read side of the pipe and fd[1 ] is 
for writing. These may be used in read, write and close calls just like any other file 
descriptors. 

If a process reads a pipe which is empty, it will wait until data arrives; if a process writes 
into a pipe which is too full, it will wait until the pipe empties somewhat. If the write side of 
the pipe is closed, a subsequent read will encounter end of file. 

To illustrate the use of pipes in a realistic setting, let us write a function called 
popentcmd, node), which creates a process end (just as systen does), and returns a file 
descriptor that will either read or write that process, according to node. That is, the call 

fout - popen("pr"j WRITE); 

creates a process that executes the pr command; subsequent write calls using the file descrip- 
tor fout will send their data to that process through the pipe. 

popen first creates the the pipe with a pipe system call; it then forks to create two 
copies of itself. The child decides whether it is supposed to read or write, closes the other side 
of the pipe, then calls the shell (via execl) to run the desired process. The parent likewise 
closes the end of the pipe it does not use. These closes are necessary to make end-of-file tests 
work properly. For example, if a child that intends to read fails to close the write end of the 
pipe, it will never see the end of the pipe file, just because there is one writer potentially active. 
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^include <stdio.h> 

READ 0 
WRITS 1 

tst(a, b) (mdde « READ ? (b) : (a)) 
int popen_pid; 

popen(cmd, mode) 
char *cmd; 
int mode; 

{ 

int pC2] ; 

if (pipe(p) < 0) 

return (NULL) ; 
if ((popen_pid - forkO) — 0) { 

close <tst(p [WRITE] , p [READ] )) ; 

close (tst(0, 1 ) ) ; 

dup (tat (p [READ] , p [WRITE]) ); 

close ( tst (p [READ] , p [WRITE} ) ) ; 

execlOVbin/sh", "sh", "-c", cmd, 0); 

_exit(1); /* disaster has occurred if we get here */ 

) 

if (popen_pid « -1 ) 

return (NULL) ; 
Close (tst (p [READ] , p [WRITE] )); 
return ( tst (p [WRITE] , p[READ] ) ) ; 

) 

The sequence of closes in the child is a bit tricky. Suppose that the task is to create a child 
process that will read data from the parent. Then the first close closes the write side of the 
pipe, leaving the read side open. The lines 

close (tst (0, D); 

dup (tst (p [READ] t p [WRITE] ) ) ; 

are the conventional way to associate the pipe descriptor with the standard input of the child. 
The close closes file descriptor 0, that is, the standard input, dup is a system call that returns 
a duplicate of an already open file descriptor. File descriptors are assigned in increasing order 
and the first available one is returned, so the effect of the dup is to copy the file descriptor for 
the pipe (read side) to file descriptor 0; thus the read side of the pipe- becomes the standard 
input. (Yes, this is a bit tricky, but it's a standard idiom.) Finally, the old read side of the pipe 
is closed. 

A similar sequence of operations takes place when the child process is supposed to write 
from the parent instead of reading. You may find it a useful exercise to step through that case. 

The job is not quite done, for we still need a function pclose to close the pipe created by 
popen. The main reason for using a separate function rather than close is that it is desirable 
to wait for the termination of the child process. First, the return value from pclose indicates 
whether the process succeeded. Equally important when a process creates several children is 
that only a bounded number of unwaited-for children can exist, even if some of them have ter- 
minated: performing the wait lays the child to rest. Thus: 



tdefine 
tdefine 
#def ine 
static 
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tinclude <signal.h> 

pclose(fd) /* close pipe fd */ 

int fd; 

I 

register r, (*hstat)(), (*istat)(), (*qstat)(); 

int status ; 

extern int popen_pid; 

close (fd) ; 

istat - signal (SIGINT, SZ6.Z6N) ; 
qstat - signal (SZGQUZT, SZ6.ZGM) ; 
hstat - signal (SZGHUP, SZG.ZGN); 

while ((r - wait (astatus) ) !- popen_pid U r !■ -1); 
if (r — -1 ) 

status ■ —1 ; 
signal (SZGZNT, istat); 
signal ( SZGQUZT , qstat ) ; 
signal (SZGKUP , hstat) ; 
return ( status ) ; 

I 

The calls to signal make sure that no interrupts, etc., interfere with the waiting process; this 
is the topic of the next section. 

The routine as written has the limitation that only one pipe may be open at once, because 
of the single shared variable popen_pid; it really should be an array indexed by file descrip- 
tor. A popen function, with slightly different arguments and return value is available as part 
of the standard I/O library discussed below. As currently written, it shares the same limitation. 

6. SIGNALS - INTERRUPTS AND ALL THAT 

This section is concerned with how to deal gracefully with signals from the outside world 
(like interrupts), and with program faults. Since there's nothing very useful that can be done 
from within C about program faults, which arise mainly from illegal memory references or from 
execution of peculiar instructions, we'll discuss only the outside-world signals: interrupt, which 
is sent when the DEL character is typed; quit, generated by the FS character; hangup, caused by 
hanging up the phone; and terminate, generated by the kill command. When one of these 
events occurs, the signal is sent to all processes which were started from the corresponding ter- 
minal; unless other arrangements have been made, the signal terminates the process. In the 
quit case, a core image file is written for debugging purposes. 

The routine which alters the default action is called signal. It has two arguments: the 
first specifies the signal, and the second specifies how to treat it. The first argument is just a 
number code, but the second is the address is either a function, or a somewhat strange code 
that requests that the signal either be ignored, or that it be given the default action. The 
include file signal. h gives names for the various arguments, and should always be included 
when signals are used. Thus 

tinclude <signal.h> 

• • • 

signal (SZGZNT, SZG.ZGN) ; 
causes interrupts to be ignored, while 

signal (SZGZNT, SZG.OFL) ; 

restores the default action of process termination. In all cases, signal returns the previous 
value of the signal. The second argument to signal may instead be the name of a function 
(which has to be declared explicitly if the compiler hasn't seen it already). In this case, the 
named routine will be called when the signal occurs. Most commonly this facility is used to 
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allow the program to clean up unfinished business before terminating, for example to delete a 
temporary file: 

♦include <signal.h> 

mainO 

{ 

int onintr ( ) ; 

if (signal (SIGINT, SIG.IGN) !- SIG.IGN) 
signal (SIGINT, onintr); 

/• Process . . . */ 

exit(O); 

) 

onintr ( ) 
( 

unlink (tempfile) ; 
exitd); 

) 

Why the test and the double call to signal? Recall that signals like interrupt are sent to 
all processes started from a particular terminal. Accordingly, when a program is to be run non- 
interactively (started by &), the shell turns off interrupts for it so it won't be stopped by inter- 
rupts intended for foreground processes. If this program began by announcing that all inter- 
rupts were to be sent to the onintr routine regardless, that would undo the shell's effort to 
protect it when run in the background. 

The solution, shown above, is to test the state of interrupt handling, and to continue to 
ignore interrupts if they are already being ignored. The code as written depends on the fact 
that signal returns the previous state of a particular signal. If signals were already being 
ignored, the process should continue to ignore them; otherwise, they should be caught. 

A more sophisticated program may wish to intercept an interrupt and interpret it as a 
request to stop what it is doing and return to its own command-processing loop. Think of a 
text editor: interrupting a long printout should not cause it to terminate and lose the work 
already done. The outline of the code for this case is probably best written like this: 

♦include <signal.h> 
# include <setjmp.h> 
jmp.buf sjbuf; 

main ( ) 

{ 

int (*istat)(), onintr 0; 

istat * signal (SIGINT, SIG.IGN); /* save original status */ 
setjrap( sjbuf ) ; /* save current stack position */ 
if (istat !- SIG.IGN) 

signal (SIGINT, onintr); 

/* main processing loop */ 

) 



- 16- 



onintr ( ) 

{ 

printf ("\nlnterrupt\n") ; 

longjmp(sjbuf ) ; /* return to saved state */ 

) 

The include file setjmp.h declares the type jmp.buf an object in which the state can be 
saved, s jbuf is such an object; it is an array of some sort. The set jmp routine then saves 
the state of things. When an interrupt occurs, a call is forced to the onintr routine, which 
can print a message, set flags, or whatever, long jmp takes as argument an object stored into 
by set jmp, and restores control to the location after the call to set jmp, so control (and the 
stack level) will pop back to the place in the main routine where the signal is set up and the 
main loop entered. Notice, by the way, that the signal gets set again after an interrupt occurs. 
This is necessary; most signals are automatically reset to their default action when they occur. 

Some programs that want to detect signals simply can't be stopped at an arbitrary point, for 
example in the middle of updating a linked list. If the routine called on occurrence of a signal 
sets a flag and then returns instead of calling exit or long jmp, execution will continue at the 
exact point it was interrupted. The interrupt flag can then be tested later. 

There is one difficulty associated with this approach. Suppose the program is reading the 
terminal when the interrupt is sent. The specified routine is duly called; it sets its flag and 
returns. If it were really true, as we said above, that "execution resumes at the exact point it 
was interrupted,'* the program would continue reading the terminal until the user typed another 
line. This behavior might well be confusing, since the user might not know that the program is 
reading; he presumably would prefer to have the signal take effect instantly. The method 
chosen to resolve this difficulty is to terminate the terminal read when execution resumes after 
the signal, returning an error code which indicates what happened. 

Thus programs which catch and resume execution after signals should be prepared for 
"errors'* which are caused by interrupted system calls. (The ones to watch out for are reads 
from a terminal, wait, and pause.) A program whose onintr program just sets intflag, 
resets the interrupt signal, and returns, should usually include code like the following when it 
reads the standard input: 

if (getcharO — EOF) 
if (intflag) 

/* EOP caused by interrupt */ 

else 

/* true end-of-file */ 

A final subtlety to keep in mind becomes important when signal-catching is combined with 
execution of other programs. Suppose a program catches interrupts, and also includes a method 
(like "!" in the editor) whereby other programs can be executed. Then the code should look 
something like this: 

if (forkO — 0) 
execl (...); 

signal (SIGINT, SIG.IGN) ; /* ignore interrupts */ 
wait (&status) ; /* until the child is done */ 
signal (SIGINT, onintr); /* restore interrupts */ 

Why is this? Again, it's not obvious but not really difficult. Suppose the program you call 
catches its own interrupts. If you interrupt the subprogram, it will get the signal and return to 
its main loop, and probably read your terminal. But the calling program will also pop out of its 
wait for the subprogram and read your terminal. Having two processes reading your terminal is 
very unfortunate, since the system figuratively flips a coin to decide who should get each line of 
input. A simple way out is to have the parent program ignore interrupts until the child is done. 
This reasoning is reflected in the standard I/O library function system: 
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#include <signal.h> 

syStera(s) /* run command string s */ 
char *s; 

{ 

int status, pid, w; 

register int (*istat)(), (*qstat)(); 

if ((pid » forkO) — 0) { 

execl("/bin/sh M , H sh M , "-c", s, 0); 
.exit (127); 

) 

istat - signal (SIGINT, SZG.ZGN); 

qstat - signal (SIGQUIT, SZG.ZGN); 

while ( (w - wait(istatus) ) !« pid && w !- -1) 

if (w'«- -1) 

status ■ -1 ; 
signal (SIGINT, istat); 
signal (SIGQUIT, qstat); 
return ( status ) ; 

) 

As an aside on declarations, the function signal obviously has a rather strange second 
argument. It is in fact a pointer to a function delivering an integer, and this is also the type of 
the signal routine itself. The two values SIG.IGN and SIG.DFL have the right type, but are 
chosen so they coincide with no possible actual functions. For the enthusiast, here is how they 
are defined for the PDP-11; the definitions should be sufficiently ugly and nonportable to 
encourage use of the include file. 

#define SIG.DFL (int (*)())0 
♦define SZG.ZGN (int (*)()) 1 
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Appendix — The Standard I/O Library 

D. M. Ritchie 

Bell Laboratories 
Murray Hill, New Jersey 07974 

The standard I/O library was designed with the following goals in mind. 

1. It must be as efficient as possible, both in time and in space, so that there will be no hesita- 
tion in using it no matter how critical the application. 

2. It must be simple to use, and also free of the magic numbers and mysterious calls whose 
use mars the understandability and portability of many programs using older packages. 

3. The interface provided should be applicable on all machines, whether or not the programs 
which implement it are directly portable to other systems, or to machines other than the 
PDP-11 running a version of UNIX. 

1. General Usage 

Each program using the library must have the line 
^include <atdio.h> 

which defines certain macros and variables. The routines are in the normal C library, so no 
special library argument is needed for loading. All names in the include file intended only for 
internal use begin with an underscore _ to reduce the possibility of collision with a user name. 
The names intended to be visible outside the package are 

stdin The name of the standard input file 

stdout The name of the standard output file 

stderr The name of the standard error file 

EOF is actually —I, and is the value returned by the read routines on end-of-file or error. 

NULL is a notation for the null pointer, returned by pointer- valued functions to indicate an 
error 

FILE expands to struct _iob and is a useful shorthand when declaring pointers to 
streams. 

BUFSZZ is a number (viz. 512) of the size suitable for an I/O buffer supplied by the user. 
See setbuf, below. 

getc, getchar, putc, putchar, feof, £ error, fileno 

are defined as macros. Their actions are described below; they are mentioned here 
to point out that it is not possible to redeclare them and that they are not actually 
functions; thus, for example, they may not have breakpoints set on them. 

The routines in this package offer the convenience of automatic buffer allocation and out- 
put flushing where appropriate. The names stdin, stdout, and stderr are in effect con- 
stants and may not be assigned to. 

2. Calls 

FILE *fopen( filename, type) char *filenarae, *type; 

opens the file and, if needed, allocates a buffer for it. filename is a character string 
specifying the name, type is a character string (not a single character). It may be "r'\ 
"w", or "a" to indicate intent to read, write, or append. The value returned is a file 
pointer. If it is NULL the attempt to open failed. 

FILE *freopen( filename, type, ioptr) char *filename, *type; FILE *ioptr; 
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The stream named by ioptr is closed, if necessary, and then reopened as if by f open. If 
the attempt to open fails, NULL is returned, otherwise ioptr, which will now refer to the 
new file. Often the reopened stream is stdin or stdout. 

int getc (ioptr) FILE *ioptr; 

returns the next character from the stream named by ioptr, which is a pointer to a file 
such as returned by fopen. or the name stdin. The integer EOF is returned on end-of- 
file or when an error occurs. The null character \0 is a legal character. 

int f getc (ioptr) FILE * ioptr; 

acts like getc but is a genuine function, not a macro, so it can be pointed to, passed as an 
argument, etc. 

putc(c, ioptr) FILE *ioptr; 

putc writes the character c on the output stream named by ioptr, which is a value 
returned from fopen or perhaps stdout or stderr. The character is returned as value, 
but EOF is returned on error. 

fputc(c, ioptr) FILE *ioptr; 

acts like putc but is a genuine function, not a macro. 

f close (ioptr) FILE *ioptr; 

The file corresponding to ioptr is closed after any buffers are emptied. A buffer allocated 
by the I/O system is freed, f close is automatic on normal termination of the program. 

f flush(ioptr) FILE *ioptr; 

Any buffered information on the (output) stream named by ioptr is written out. Output 
files are normally buffered if and only if they are not directed to the terminal; however, 
stderr always starts off unbuffered and remains so unless setbuf is used, or unless it is 
reopened. 

exit (errcode) ; 

terminates the process and returns its argument as status to the parent. This is a special 
version of the routine which calls f flush for each output file. To terminate without flush- 
ing, use .exit. 

feof (ioptr) FILE *ioptr; 

returns non-zero when end-of-file has occurred on the specified input stream. 

f error (ioptr) FILE * ioptr; 

returns non-zero when an error has occurred while reading or writing the named stream. 
The error indication lasts until the file has been closed. 

getchar ( ) ; 

is identical to getc (stdin) . 

putchar(c) ; 

is identical to putc ( c , s tdout ) . 

char *fgets(s, n, ioptr) char *s; FILE * ioptr; 

reads up to n-1 characters from the stream ioptr into the character pointer s. The read 
terminates with a newline character. The newline character is placed in the buffer followed 
by a null character, fgets returns the first argument, or NULL if error or end-of-file 
occurred. 

fputs(s, ioptr) char *s; FILE *ioptr; 

writes the null-terminated string (character array) s on the stream ioptr. No newline is 
appended. No value is returned. 

ungetc(c, ioptr) FILE * ioptr; 
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The argument character c is pushed back on the input stream named by ioptr. Only one 
character may be pushed back. 

printf (format, a1 , . . . ) char *format; 

fprintf (ioptr, format, a1 , ...) FILE *ioptr; char *format; 

sprintf(s, format, a1 , ...)char *s, *format; 

printf writes on the standard output, fprintf writes on the named output stream, 
sprintf puts characters in the character array (string) named by s. The specifications are 
as described in section printf (3) of the UNIX Programmer's Manual 

scanf (format, a1 , . . . ) char *format; 

fscanf (ioptr, format, a1 , . ..) FILE * ioptr; char *format; 

sscanf(s, format, a1 , ...) char *s, *format; 

scanf reads from the standard input, fscanf reads from the named input stream, 
sscanf reads from the character string supplied as s. scanf reads characters, interprets 
them according to a format, and stores the results in its arguments. Each routine expects 
as arguments a control string format, and a set of arguments, each of which must be a 
pointer, indicating where the converted input should be stored. 

scanf returns as its value the number of successfully matched and assigned input items. 
This can be used to decide how many input items were found. On end of file, EOF is 
returned; note that this is different from 0, which means that the next input character does 
not match what was called for in the control string. 

fread(ptr, sizeof (*ptr) , nitems, ioptr) FILE *ioptr; 
reads nitems of data beginning at ptr from file ioptr. No advance notification that binary 
I/O is being done is required; when, for portability reasons, it becomes required, it will be done 
by adding an additional character to the mode-string on the f open call. 

f write (ptr, sizeof (*ptr) , nitems, ioptr) FILE *ioptr; 
Like freed, but in the other direction. 

rewind (ioptr) FILE * ioptr; 

rewinds the stream named by ioptr. It is not very useful except on input, since a rewound 
output file is still open only for output. 

system (string) char * string; 

The string is executed by the shell as if typed at the terminal. 

getw( ioptr) FILE * ioptr; 

returns the next word from the input stream named by ioptr. EOF is returned on end-of-file 
or error, but since this a perfectly good integer f eof and f error should be used A "word*' 
is 16 bits on the PDP-11. 

putw(w, ioptr) FILE * ioptr; 

writes the integer w on the named output stream. 

setbuf (ioptr, buf) FILE * ioptr; char *buf; 

setbuf may be used after a stream has been opened but before I/O has started. If buf is 
NULL, the stream will be unbuffered. Otherwise the buffer supplied will be used. It must be a 
character array of sufficient size: 

char buf [BUFSZZ] ; 

filenot ioptr) FILE *ioptr; 

returns the integer file descriptor associated with the file. 

f seek (ioptr, offset, ptmame) FILE *ioptr; long offset; 
The location of the next byte in the stream named by ioptr is adjusted, offset is a long 
integer. If ptmame is 0, the offset is measured from the beginning of the file; if ptrname is 
1, the offset is measured from the current read or write pointer, if ptrname is 2, the offset is 
measured from the end of the file. The routine accounts properly for any buffering. (When 
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this routine is used on non-UNIX systems, the offset must be a value returned from ftell and 
the ptrname must be 0). 

long ftell (ioptr) FILE *ioptr; 

The byte offset, measured from the beginning of the file, associated with the named stream is 
returned. Any buffering is properly accounted for. (On non-UNIX systems the value of this 
call is useful only for handing to f seek, so as to position the file to the same place it was when 
ftell was called.) 

getpw(uid, buf) char *buf; 

The password file is searched for the given integer user ID. If an appropriate line is found, it is 
copied into the character array buf, and 0 is returned If no line is found corresponding to the 
user ID then 1 is returned. 

char *malloc (num) ; 

allocates num bytes. The pointer returned is sufficiently well aligned to be usable for any pur- 
pose. NULL is returned if no space is available. 

char *calloc (num, size); 

allocates space for num items each of size size. The space is guaranteed to be set to 0 and the 
pointer is sufficiently well aligned to be usable for any purpose. NULL is returned if no space is 
available . 

cfree(ptr) char *ptr; 

Space is returned to the pool used by calloc. Disorder can be expected if the pointer was not 
obtained from calloc. 

The following are macros whose definitions may be obtained by including <ctype.h>. 

isalpha (c) returns non-zero if the argument is alphabetic. 

i supper (c) returns non-zero if the argument is upper-case alphabetic. 

islower (c) returns non-zero if the argument is lower-case alphabetic. 

isdigit (c) returns non-zero if the argument is a digit. 

isspace(c) returns non-zero if the argument is a spacing character: tab, newline, carriage 
return, vertical tab, form feed, space. 

ispunct(c) returns non-zero if the argument is any punctuation character, i.e., not a space, 
letter, digit or control character. 

isalnum(c) returns non-zero if the argument is a letter or a digit. 

isprint(c) returns non-zero if the argument is printable — a letter, digit, or punctuation 
character. 

iscntrl (c) returns non-zero if the argument is a control character, 
isascii (c) returns non-zero if the argument is an ascii character, i.e., less than octal 0200. 
toupper ( c ) returns the upper-case character corresponding to the lower-case letter c . 
tolower (c) returns the lower-case character corresponding to the upper-case letter c. 
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ABSTRACT 

Computer program input generally has some structure; in fact, every com- 
puter program that docs input can be thought of as defining an "input language" 
which it accepts. An input language may be as complex as a programming 
language, or as simple as a sequence of numbers. Unfortunately, usual input facili- 
ties arc limited, difficult to use, and often arc lax about checking their inputs for 
validity. 

Yacc provides a general tool for describing the input to a computer program. 
The Yacc user specifics the structures of his input, together with code to be 
invoked as each such structure is recognized. Yacc turns such a specification into 
a subroutine that handles the input process; frequently, it is convenient and 
appropriate to have most of the flow of control in tine user's application handled by 
this subroutine. 

The input subroutine produced by Yacc calls a user-supplied routine to 
return the next basic input item. Thus, the user can specify his input in terms of 
individual input characters, or in terms of higher level constructs such as names 
and numbers. The user-supplied routine may also handle idiomatic features such 
as comment and continuation conventions, which typically defy easy grammatical 
specification. 

Yacc is written iii portable C. The class of specifications accepted is a very 
general one: LAI.R(l) grammars with disambiguating rules. 

In addition to compilers for C, AIM., Pascal, RATFOR, etc., Yacc has also 
been used for less conventional languages, including a phototypesctter language, 
several desk calculator languages, a document retrieval system, and a Fortran 
debugging system. 
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0: Introduction 

Yacc provides a general tool for imposing structure on the input to a computer program. The 
Yacc user prepares a specification of the input process; this includes rules describing the input 
structure, code to be invoked when these rules arc recognized, and a low-level routine to do die 
basic input. Yacc then generates a function to control ..the input .process. . This functioji, called a 
parser, calls the user-supplied low-level input routine (the lexical analyzer) to pick up the basic 
items (called tokens) from the input stream. These tokens are organized according to the input 
structure rules, called grammar rules ; when one of these rules has been recognized, then user code 
supplied for this rule, an action, is invoked; actions have the ability to return values and make use 
of the values of other actions. 

Yacc is written in a portable dialect of C 1 and the actions, and output subroutine, arc in C as 
well. Moreover, many of the syntactic conventions of Yacc follow C. 

The heart of the input specification is a collection of grammar rules. Each rule describes an 
allowable structure and gives it a name. For example, one grammar rule might be 

date : month_namc day ',' year ; 

Here, dale, monthjmme, day, and year represent structures of interest in the input process; 
presumably, monthjmme , day, and year arc defined elsewhere. The comma is enclosed in sin- 
gle quotes; this implies that the comma is to appear literally in the input. The colon and semicolon 
merely serve as punctuation in the rule, and have no significance in controlling the input. Thus, 
with proper definitions, the input 

July 4, 1776 

might be matched by the above rule. 

An important part of the input process is carried out by the lexical analyzer. This user rou- 
tine reads the input stream, recognizing the lower level structures, and communicates these tokens 
to the parser. For historical reasons, a structure recognized by the lexical analyzer is called a termi- 
nal symbol, while the structure recognized by the parser is called a nonterminal symbol. To avoid 
confusion, terminal symbols will usually be referred to as tokens. 

There is considerable leeway in deciding whether to recognize structures using the lexical 
analyzer or grammar rules. For example, the rules 

monthjiamc : T 'a' 'n' ; 
month_namc : 'F' V V ; 



month_namc : 'D' 'c' 'c' ; 

might be used in the above example. The lexical analyzer would only need to recognize individual 
letters, and monthjmme would be a nonterminal symbol. Such low-level rules tend to waste time 
and space, and may complicate the specification beyond Yacc's ability to deal with it. . Usually, the 
lexical analyzer would recognize the month names, and return an indication that a monthjmme 
was seen; in this case, monthjmme would be atoken. 
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Litcral characters such as must also be passed through the lexical analyzer, and arc also 
considered tokens. 

Specification files are very flexible. It is rcalively easy to add to the above example the rule 
date : month '/' day 7' year ; 

allowing 

7 / 4 / 1776 
as a synonym for 

July 4, 1776 

In most cases, this new rule could be "slipped in" to a working system with minimal effort, and lit- 
tle danger of disrupting existing input. 

The input being read may not conform to the specifications. These input errors are detected 
as early as is theoretically possible with a left-to-right scan; thus, not only is the chance of reading 
and computing with bad input data substantially reduced, but the bad data can usually be quickly 
found. Hrror handling, provided as part of die input specifications, permits the reentry of bad data, 
or the continuation of the input process after skipping over the bad data. 

In some cases, Yacc fails to produce a parser when given a set of specifications. For example, 
the specifications may be self contradictory, or they may require a more powerful recognition 
mechanism than that available to Yacc. The former cases represent design errors; the latter cases 
can often be corrected by making the lexical analyzer more powerful, or by rewriting some of the 
grammar rules. While Yacc cannot handle all possible specifications, its power compares favorably 
with similar systems; moreover, the constructions which arc difficult for Yacc to handle arc also fre- 
quently difficult for human beings to handle. Some users have reported that the discipline of for- 
mulating valid Yacc specifications for their input revealed errors of conception or design early in 
the program development. 

The theory underlying Yacc has been described elsewhere. 2 ' 3 - 4 Yacc has been extensively 
used in numerous practical applications, including //'/// , 5 the Portable C Compiler, 6 and a system for 
typesetting mathematics. 7 

The next several sections describe the basic process of preparing a Yacc specification; Section 
1 describes the preparation of grammar rules, Section 2 the preparation of the user supplied actions 
associated with these rules, and Section 3 the preparation of lexical analyzers. Section 4 describes 
the operation of the parser. Section 5 discusses various reasons why Yacc may be unable to pro- 
duce a parser from a specification, and what to do about it. Section 6 describes a simple mechan- 
ism for handling operator precedences in arithmetic expressions. Section 7 discusses error detection 
and recovery. Section 8 discusses the operating environment and special features of the parsers 
Yacc produces. Section 9 gives some suggestions which should improve the style and efficiency of 
the specifications. Section 10 discusses some advanced topics, and Section 11 gives acknowledge- 
ments. Appendix A has a brief example, and Appendix B gives a summary of the Yacc input syn- 
tax. Appendix C gives an example using some of the more advanced features of Yacc, and, finally, 
Appendix 1) describes mechanisms and syntax no longer actively supported, but provided for his- 
torical continuity with older versions of Yacc. 

1: Basic Specifications 

Names refer to cither tokens or nonterminal symbols. Yacc requires token names to be 
declared as such. In addition, for reasons discussed in Section 3, it is often desirable to include the 
lexical analyzer as part of the specification file; it may be useful to include other programs as well. 
Thus, every specification file consists of three sections; the declarations, (grammar) rules, and pro- 
grams. The sections arc separated by double percent "%%" marks. ( The percent is generally 
used in Yacc specifications as an escape character.) 



In other words, a full specification file looks like 

declarations 
%% 
rules 
%% 

programs 

The declaration section may be empty. Moreover, if the programs section is omitted, the 
second %% mark may be omitted also; thus, the smallest legal Yacc specification is 

%% 
rules 

Blanks, tabs, and ncwlincs are ignored except that they may not appear in names or multi- 
character reserved symbols. Comments may appear wherever a name is legal; they arc enclosed in 
/* . . . */, as in C and PL/1. 

The rules section is made up of one or more grammar rules. A grammar rule has the form: 
A : BODY ; 

A represents a nonterminal name, and BODY represents a sequence of zero or more names and 
literals. The colon and the semicolon arc Yacc punctuation. 

Names may be of arbitrary length, and may be made up of letters, dot underscore 
and non-initial digits. Upper and lower case letters are distinct. The names used in the body of a 
grammar rule may represent tokens or nonterminal symbols. 

A literal consists of a character enclosed in single quotes As in C, the backslash "V is an 
escape character within literals, and all the C escapes arc recognized. Thus 

'\n' ncwlinc 

'\r' return 

V single quote ""' 

'W backslash "\" 

'\t' tab 

'\b' backspace 

'\f form feed 

'\xxx' "xxx" in octal 

For a number of technical reasons, the nul character ('\0' or 0) should never be used in grammar 
Riles. 

If there arc several grammar rules with the same left hand side, the vertical bar "|" can be 
used to avoid rewriting the left hand side. In addition, the semicolon at the end of a rule can be 
dropped before a vertical bar. Thus the grammar rules 

A : BCD; 
A : BF ; 
A : G ; 

can be given to Yacc as 

A : B C D 
I E F 
I G 
» 

It is not necessary that all grammar rules with the same left side appear together in the grammar 
rules section, although it makes the input much more readable, and easier to change. 
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If a nonterminal symbol matches the empty string, this can be indicated in the obvious way: 
empty : ; 

Names representing tokens must be declared; this is most simply done by writing 

%token namel name2 . . . 

in the declarations section. (Sec Sections 3 , 5, and 6 for much more discussion). Every name not 
defined in the declarations section is assumed to represent a nonterminal symbol. Every nontermi- 
nal symbol must appear on the left side of at least one rule. 

Of all the nonterminal symbols, one, called the start symbol, has particular importance. The 
parser is designed to recognize the start symbol; thus, this symbol represents the largest, most gen- 
eral structure described by the grammar rules. By default, the start symbol is token to be the left 
hand side of the first grammar rule in the rules section. It is possible, and in fact desirable, to 
declare the start symbol explicitly in the declarations section using the %start keyword: ~" 

%start symbol 

The end of the input to the parser is signaled by a special token, called the endmarker. If the 
tokens up to, but not including, the endmarker form a stmcturc which matches the start symbol, the 
parser function returns to its caller after the endmarker is seen; it accepts the input. If the end- 
marker is seen in any other context, it is an error. 

It is the job of the user-supplied lexical analyzer to return the endmarker when appropriate; 
sec section 3, below. Usually the endmarker represents some reasonably obvious I/O status, such 
as "cnd-of-filc" or "end-of-rccord". 

2: Actions 

With each grammar rule, the user may associate actions to be performed each time the rule is 
recognized in the input process. These actions may return values, and may obtain the values 
returned by previous actions. Moreover, the lexical analyzer can return values for tokens, if 
desired. 

An action is an arbitrary C statement, and as such can do input and output, call subprograms, 
and alter external vectors and variables. An action is specified by one or more statements, enclosed 
in curly braces "{" and "}". For example, 

A : '(' B ')' 

{ hcllo( 1, "abc" ); } 

and 

XXX: YYY ZZZ 

{ printfC'a mcssagcXn"); 
flag = 25; } 

arc grammar rules with actions. 

To facilitate easy communication between the actions and the parser, the action statements arc 
altered slightly. The symbol "dollar sign" is used as a signal to Yacc in this context 

To return a value, the action normally sets the pseudo-variable "$$" to some value. For 
example, an action that docs nothing but return the value 1 is 

{ $$ = 1; } 

To obtain the values returned by previous actions and the lexical analyzer, the action may use 
the pseudo-variables $1, $2, . . ., which refer to the values returned by the components of the right 
side of a rule, reading from left to right. Thus, .if the Rile is 



A : BCD; 

for example, then $2 has the value returned by C, and $3 the value returned by D. 
As a more concrete example, consider the rule 
cxpr : '(' cxpr ')' ; 

The value returned by this rule is usually the value of the expr in parentheses. This can be indi- 
cated by 

expr : '(' cxpr ')' { $$ = $2 ; } 

By default, the value of a rule is the value of the first clement in it ($1). Thus, grammar rules 
of the form 

:v: : A: > ■< B • ;• ' • ■- ■ '< "■■ ' • " • '1 • ' v 1 t . 

frequently need not have an explicit action. 

In the examples above, all the actions came at the end of their rules. Sometimes, it is desir- 
able to get control before a rule is fully parsed. Yacc permits an action to be written in the middle 
of a rule as well as at the end. This rule is assumed to return a value, accessible through the usual 
mechanism by the actions to the right of it. In turn, it may access the values returned by the sym- 
bols to its left. Thus, in the rule 

A : B 

{ $$ = 1; } 

C 

{ x = $2; y = $3; } 

» 

the effect is to set x to 1, and y to the value returned by C. 

Actions that do not terminate a rule arc actually handled by Yacc by manufacturing a new 
nonterminal symbol name, and a new rule matching this name to the empty string. The interior 
action is the action triggered off by recognizing this added rule. Yacc actually treats the above 
example as if it had been written: 

$ACT : /* empty */ 
{$$=!;} 



A : B $ACf C 

{ x = $2; y = $3; } 

» 

in many applications, output is not done directly by the actions; rather, a data structure, such 
as a parse tree, is constructed in memory, and transformations arc applied to it before output is 
generated. Parse trees arc particularly easy to construct, given routines to build and maintain the 
tree structure desired. For example, suppose there is a C function node, written so that the call 

nodc( L, nl, n2 ) 

creates a node with label L, and descendants nl and n2, and returns the index of the newly created 
node. Then parse tree can be built by supplying actions such as: 

cxpr. : expr expr 

{ $$ = nodc('+\ $1, $3); } 

in the specification. 

The user may define other variables to be used by the actions. Declarations and definitions 
can appear in the declarations section, enclosed in the marks "%{" and "%}". These declarations 



and definitions have global scope, so they arc known to the action statements and the lexical 
analyzer. For example, 

%{ int variable = 0; %} 

could be placed in the declarations section, making variable accessible to all of the actions. The 
Yacc parser uses only names beginning in "yy"; the user should avoid such names. 

In these examples, all the values arc integers: a discussion of values of other types will be 
found in Section 10. 

3: Lexical Analysis 

The user must supply a lexical analyzer to read the input stream and communicate tokens 
(with values, if desired) to the parser. The lexical analyzer is an integer- valued function called 
yylex. The function returns an integer, the token number, representing the kind of token read. If 
there is a value associated with that token, it should be assigned to the external variable yylval. 

The parser and the lexical analyzer must agree on these token numbers in order for commun- 
ication between them to take place. The numbers may be chosen by Yacc, or chosen by the user. 
In cither case, the "# define" mechanism of C is used to allow the lexical analyzer to return these 
numbers symbolically. For example, suppose that the token name DIGIT has been defined in the 
declarations section of the Yacc specification file. The relevant portion of the lexical analyzer 
might look like: 

yylcxO{ 

extern int yylval; 
int c; 

c = gctcharO; 

switch( c ) { 

case '0': 
case T': 

case '9': 

yylval = c-'O'; 
rcturn( DIGIT ); 

}" 

The intent is to return a token number of DIGIT, and a value equal to the numerical value of 
the digit. Provided that the lexical analyzer code is placed in the programs section of the 
specification file, the identifier DIGIT will be defined as the token number associated with the 
token DIGIT. 

This mechanism leads to clear, easily modified lexical analyzers; the only pitfall is the need to 
avoid using any token names in the grammar that arc reserved or significant in C or the parser; for 
example, the use of token names if or while will almost certainly cause severe difficulties when the 
lexical analyzer is compiled. The token name error is reserved for error handling, and should not 
be used naively (sec Section 7). 

As mentioned above, the token numbers may be chosen by Yacc or by the user. In the 
default situation, the numbers arc chosen by Yacc. The default token number for a literal character 
is the numerical value of the character in the local character set. Other names arc assigned token 
numbers starting at 257. 
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To assign a token number to a token (including literals), the first appearance of the token 
name or literal in the declarations section can be immediately followed by a nonnegativc integer. 
This integer is taken to be the token number of the name or literal. Names and literals not defined 
by this mechanism retain their default definition. It is important that all token numbers be distinct. 

For historical reasons, the endmarker must have token number 0 or negative. This token 
number cannot be redefined by the user; thus, all lexical analyzers should be prepared to return 0 
or negative as a token number upon reaching the end of their input. 

A very useful tool for constructing lexical analyzers is the Lex program developed by Mike 
Lcsk. 8 These lexical analyzers arc designed to work in close harmony with Yacc parsers. The 
specifications for these lexical analyzers use regular expressions instead of grammar rules. Lex can 
be easily used to produce quite complicated lexical analyzers, but there remain some languages 
(such as FORTRAN) which do not fit any theoretical framework, and whose lexical analyzers must 
be crafted by hand. 

4: How the Parser Works 

Yacc turns the specification file into a C program, which parses the input according to the 
specification given. The algorithm used to go from the specification to the parser is complex, and 
will not be discussed here (sec the references for more information). The parser itself, however, is 
relatively simple, and understanding how it works, while not strictly necessary, will nevertheless 
make treatment of error recovery and ambiguities much more comprehensible. 

The parser produced by Yacc consists of a finite state machine with a stack. The parser is 
also capable of reading and remembering the next input token (called the lookahead token). The 
current state is always the one on the top of the stack. The states of the finite state machine are 
given small integer labels; initially, the machine is in state 0, the stack contains only state 0, and no 
lookahead token has been read. 

The machine has only four actions available to it, called shift, reduce, accept, and error. A 
move of the parser is done as follows: 

1. Based on its current state, the parser decides whether it needs a lookahead token to decide 
what action should be done; if it needs one, and docs not have one, it calls yylex to obtain 
the next token. 

2. Using the current state, and the lookahead token if needed, the parser decides on its next 
action, and carries it out. This may result in suites being pushed onto the slack, or popped 
off of the stick, and in the lookahead token being processed or left alone. 

The shift action is the most common action the parser tikes. Whenever a shift action is 
token, there is always a lookahead token. For example, in state 56 there may be an action: 

IF shift 34 

which says, in state 56, if the lookahead token is IF, the current state (56) is pushed down on the 
stack, and state 34 becomes the current state (on the top of the stick). The lookahead token is 
cleared. 

The reduce action keeps the stick from growing without bounds. Reduce actions arc 
appropriate when the parser has seen the right hand side of a grammar rule, and is prepared to 
announce that it has seen an instance of the rule, replacing the right hand side by the left hand 
side. It may be necessary to consult the lookahead token to decide whether to reduce, but usually 
it is not; in fact, the default action (represented by a ".") is often a reduce action. 

Reduce actions are associated with individual grammar rules. Grammar rules are also given 
small integer numbers, leading to some confusion. The action 

reduce 18 

refers to grammar rule 18, while the action • 



. IF shift 34 



refers to slate 34. 

Suppose the rule being reduced is 
A : x y z ; 

The reduce action depends on the left hand symbol (A in this case), and the number of symbols on 
the right hand side (three in this case). To reduce, first pop ofF the top three states from the stack 
(In general, the number of states popped equals the number of symbols on the right side of the 
Rile). In effect, these states were the ones put on the stack while recognizing x, y, and z, and no 
longer serve any useful purpose. After popping these states, a state is uncovered which was the 
state the parser was in before beginning to process the rule. Using this uncovered state, and the 
symbol on the left side of the rule, perform what is in effect a shift of A. A new state is obtained, 
pushed onto the stack, and parsing continues. There arc significant differences between the pro- 
cessing of the left hand symbol and an ordinary shift of a token, however, so this action is called a 
goto action. In particular, the lookahcad token is cleared by a shift, and is not affected by a goto. 
In any case, the uncovered state contains an entry such as: 

A goto 20 

causing state 20 to be pushed onto the stack, and become the current state. 

In effect, the reduce action "turns back the clock" in the parse, popping the states off the 
stack to go back to the state where the right hand side of the rule was first seen. The parser then 
behaves as if it had seen the left side at that time. If the right hand side of the rule is empty, no 
states arc popped off of the stack: the uncovered state is in fact the current state. 

The reduce action is also important in the treatment of user-supplied actions and values. 
When a rule is reduced, the code supplied with the rule is executed before the stack is adjusted. In 
addition to the* stack holding the states, another stack, running in parallel with it, holds the values 
returned from the lexical analyzer and the actions. When a shift takes place, the external variable 
yylval is copied onto the value stack. After the return from the user code, the reduction is carried 
out. When the goto action is done, the external variable yyval is copied onto the value stack. The 
pseudo-variables $1, $2, etc., refer to the value stack. 

The other two parser actions arc conceptually much simpler. The accept action indicates that 
the entire input has been seen and that it matches the specification. This action appeal's only when 
the lookahcad token is the endmarker, and indicates that the parser has successfully done its job. 
The error action, on the other hand, represents a place where the parser can no longer continue 
parsing according to the specification. The input tokens it has seen, together with the lookahcad 
token, cannot be followed by anything that would result in a legal input. The parser reports an 
error, and attempts to recover the situation and resume parsing: the error recovery (as opposed to 
the detection of error) will be covered in Section 7. 

It is time for an example! Consider the specification 

%tokcn DING DONG DKLL 
%% 

rhyme : sound place 

sound : DING DONG 

place : DELL 

• » 

When Yacc is invoked with the -v option, a file called y.output is produced, with a human- 
readable description of the parser. The y.output file corresponding to the above grammar (with 
some statistics stripped off the end) is: 



-9- 



stateO 

$accept : _rhyme $end' 

DING shift 3 
. error 

rhyme goto ! 
sound goto 2 

state 1 

$accept : rhymc_$end 

$cnd accept 
. error 

state 2 

rhyme : sound_placc 

DELL shift 5 
. error 

place goto 4 

state 3 

sound : DING.DONG 

DONG shift 6 
. error 

state 4 

rhyme : sound placc_ (1) 

. reduce 1 

state 5 

place : DELL_ (3) 

. reduce 3 

state 6 

sound : DING DONG_ (2) 
. reduce 2 

Notice that, in addition to the actions for each suite, there is a description of the parsing rules being 
processed in each state. The _ character is used to indicate what has been seen, and what is yet to 
come, in each rule. Suppose the input is 

DING DONG DELL 

It is instructive to follow the steps of the parser while processing this input 

Initially, the current state is state 0. The parser needs to refer to the input in order to decide 
between the actions available in state 0, so the first token, DING, is read, becoming the lookahcad 
token. The action in state 0 on DING is is "shift 3", so state 3 is pushed onto the stack, and the 
lookahcad token is cleared. State 3 becomes the current state. The next token, DONG, is read, 
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becoming the lookahead token. The action in state 3 on the token DONG is "shift 6", so state 6 is 
pushed onto the stack, and the lookahead is cleared. The stack now contains 0, 3, and 6. In state 
6, without even consulting the lookahead, the parser reduces by rule 2. 

sound : DING DONG 

This rule has two symbols on the right hand side, so two states, 6 and 3, are popped off of the 
stack, uncovering state 0. Consulting the description of state 0, looking for a goto on sound, 

sound goto 2 

is obtained; thus state 2 is pushed onto the stack, becoming the current state. 

In state 2, the next token, DELL, must be read. The action is "shift 5", so state 5 is pushed 
onto the stack, which now has 0, 2, and 5 on it, and the lookahead token is cleared. In state 5, the 
only action is to reduce by rule 3. This has one symbol on the right hand side, so one state, 5, is 
popped off, and state 2 is uncovered. ' The goto in state 2 on place", the left -side of rule 3r is state 4. 
Now, the stack contains 0, 2, and 4. In state 4, the only action is to reduce by rule 1. There are 
two symbols on die right, so the top two states are popped off, uncovering state 0 again. In state 0, 
there is a goto on rhyme causing the parser to enter state 1. In state 1, the input is read; the end- 
marker is obtained, indicated by "$cnd" in the y.output file. The action in state 1 when the end- 
marker is seen is to accept, successfully ending the parse. 

The reader is urged to consider how the parser works when confronted with such incorrect 
strings as DING DONG DONG, DING DONG, DING DONG DELL DELL, etc. A few minutes 
spend with this and other simple examples will probably be repaid when problems arise in more 
complicated contexts. 

5: Ambiguity and Conflicts 

A set of grammar rules is ambiguous if there is some input string that can be stmctured in 
two or more different ways. For example, the grammar rule 

expr : . cxpr '— ' cxpr 

is a natural way of expressing the fact that one way of forming an arithmetic expression is to put 
two other expressions together with a minus sign between them. Unfortunately, mis grammar rule 
docs not completely specify the way that all complex inputs should be structured. For example, if 
the input is 

cxpr - cxpr - cxpr 

the rule allows this input to be structured as either 

( cxpr - cxpr ) - cxpr 

or as 

expr — ( expr — cxpr ) 
(The first is called Icfl association, the second right association). 

Yacc detects such ambiguities when it is attempting to build the parser. It is instructive to 
consider the problem that confronts the parser when it is given an input such as 

cxpr - expr - cxpr 

When the parser has read the second cxpr, the input that it has seen: 

expr - expr 

matches the right side of the grammar rule above. The parser could reduce the input by applying 
this rule; after applying the. rule; the input is reduced to expr(\hc left side of the rule). The parser 
would then read the final part of the input: 
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— expr 

and again reduce. The effect of this is to take the left associative interpretation. 
Alternatively, when the parser has seen 

expr - expr 

it could defer the immediate application of the rule, and continue reading the input until it had 
seen 

expr - expr - expr 
It could then apply the rule to the rightmost three symbols, reducing them to expr and leaving 
expr - expr 

Now "the rule can be reduced once more; the effect is to take the right* associative interpretation. 
Thus, having read 

expr - expr 

the parser can do two legal things, a shift or a reduction, and has no way of deciding between 
them. This is called a shift / reduce conflict. It may also happen that the parser has a choice of 
two legal reductions; this is called a reduce / reduce conflict. Note that there arc never any 
"Shift/shift" conflicts. 

When there are shift/reduce or reduce/reduce conflicts, Yacc still produces a parser. It does 
this by selecting one of the valid steps wherever it has a choice. A rule describing which choice to 
make in a given situation is called a disambiguating rule. 

Yacc invokes two disambiguating rules by default: 

1. In a shift/ reduce conflict, the default is to do die shift. 

2. In a reduce/ reduce conflict, the default is to reduce by the earlier grammar ailc (in the input 
sequence). 

Rule 1 implies that reductions arc deferred whenever there is a choice, in favor of shifts. 
Rule 2 gives the user rather crude control over the behavior of the parser in this situation, but 
reduce/ reduce conflicts should be avoided whenever possible. 

Conflicts may arise because of mistakes in input or logic, or because the grammar rules, while 
consistent, require a more complex parser than Yacc can construct. The use of actions within rules 
can also cause conflicts, if the action must be done before the parser can be sure which rule is 
being recognized. In these cases, the application of disambiguating rules is inappropriate, and leads 
to an incorrect parser. For this reason, Yacc always reports the number of shift/reduce and 
reduce/reduce conflicts resolved by Rule 1 and Rule 2. 

In general, whenever it is possible to apply disambiguating rules to produce a correct parser, 
it is also possible to rewrite the grammar rules so that the same inputs arc read but there arc no 
conflicts. For this reason, most previous parser generators have considered conflicts to be fatal 
errors. Our experience has suggested that this rewriting is somewhat unnatural, and produces 
slower parsers; thus, Yacc will produce parsers even in the presence of conflicts. 

As an example of the power of disambiguating rules, consider a fragment from a program- 
ming language involving an "if-thcn-clsc" construction: 

stat : IF '(' cond ')' stat 

| IF '(' cond ')' stat ELSE stat 
> 

In dicsc rules, IF and ELSE arc tokens, cond is a nonterminal symbol describing conditional (log- 
ical) expressions, and slat is a nonterminal symbol describing statements. The first rule will be 
called the simple- if rule, and the second the if else rule. 
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These two rules form an ambiguous construction, since input of the form 

IF ( Ci ) IF ( C2 ) SI ELSE S2 

can be structured according to these rules in two ways: 

IF ( CI ) { 

IF ( C2 ) SI m 
} 

ELSE S2 

or 

IF ( CI ) { 

IF ( C2 ) SI 
ELSE S2 

} ■ * 

The second interpretation is the one given in most programming languages having this construct 
Each ELSE is associated with the last preceding "\m-KlSE\T IF. In this example, consider the 
situation where the parser has seen 

IF ( CI ) IF ( C2 ) SI 
and is looking at the ELSE. It can immediately reduce by the simplc-if rule to get 

IF ( CI ) stat 
and then read the remaining input, 

ELSE S2 
and reduce 

IF ( CI ) stat ELSE S2 
by the if-clsc rule. This leads to the first of the above groupings of the input. 

On the other hand, the ELSE may be shifted, S2 read, and then the right hand portion of 

IF ( CI ) IF ( C2 ) SI ELSE S2 
can be reduced by the if-clsc rule to get 

IF ( CI ) stat 

which can be reduced by the simplc-if rule. This leads to the second of the above groupings of the' 
input, which is usually desired. 

Once again the parser can do two valid tilings — there is a shift/reduce conflict. The applica- 
tion of disambiguating rule 1 tells the parser to shift in this case, which leads to the desired group- 
ing. : N 

This shift/ reduce conflict arises only when there is a particular current input symbol, ELSE, 
and particular inputs already seen, such as 

IF ( CI ) IF ( C2 ) SI 

In general, there may be many conflicts, and each one will be associated with an input symbol and 
a set of previously read inputs. The previously read inputs are characterized by the state of the 
parser. 

The conflict messages of Yacc arc best understood by examining the verbose (— v) option out- 
put file. For example, the output corresponding to the above conflict state might be: 
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23: shift/reduce conflict (shift 45, reduce 18) on ELSE 
state 23 

stat : IF ( cond ) stat_ (18) 
stat : IF ( cond ) stat_ELSE stat 

ELSE shift 45 
reduce 18 

The first line describes the conflict, giving the state and the input symbol. The ordinary state 
description follows, giving the grammar rules active in the state, and the parser actions. Recall that 
the underline marks the portion of the grammar. rules which has been seen. /Phus in the example, 
in state 23 the parser has seen input corresponding to 

IF ( cond ) stat 

and the two grammar rules shown are active at this time. The parser can do two possible things. If 
the input symbol is ELSE, it is possible to shift into state 45. State 45 will have, as part of its 
description, the line 

stat : IF ( cond ) stat ELSEjstat 

since the ELSE will have been shifted in this state. Back in state 23, the alternative action, 
described by is to be done if the input symbol is not mentioned explicitly in the above actions; 
thus, in this case, if the input symbol is not ELSE, the parser reduces by grammar rule 18: 

stat : IF '(' cond ')' stat 

Once again, notice that the numbers following "shift" commands refer to other states, while die 
numbers following "reduce" commands refer to grammar rule numbers. In the y.oulput file, the 
rule numbers are printed after those rules which can be reduced. In most one states, there will be 
at most reduce action possible in the suite, and this will be the default command. The user who 
encounters unexpected shift/ reduce conflicts will probably want to look at the verbose output to 
decide whether the default actions arc appropriate. In really tough cases, the user might need to 
know more about the behavior and construction of the parser than can be covered here. In this 
case, one of the theoretical references 2 « 3 - 4 might be consulted; the services of a local guru might 
also be appropriate. 

6: Precedence 

There is one common situation where the rules given above for resolving conflicts arc not 
sufficient; this is in the parsing of arithmetic expressions. Most of the commonly used construc- 
tions for arithmetic expressions can be naturally described by the notion of precedence levels for 
operators, together with information about left or right associativity. It turns out that ambiguous 
grammars with appropriate disambiguating rules can be used to create parsers that arc faster and 
easier to write than parsers constructed from unambiguous grammars. The basic notion is to write 
grammar rules of die form 

cxpr : cxpr OP cxpr 

and 

cxpr : UNARY cxpr 

for all binary and unary operators desired. This creates a very ambiguous grammar, with many 
parsing conflicts. As disambiguating rules, the user specifics the precedence, or binding strength, of 
all the operators, and the associativity of the "binary operators. This information is sufficient to 
allow Yacc to resolve die parsing conflicts in accordance with these rules, and construct a parser 
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that realizes the desired precedences and associativities. 

The precedences and associativities are attached to tokens in the declarations section. This is 
done by a series of lines beginning with a Yacc keyword: %left, %right, or %nonassoc, followed by 
a list of tokens. All of the tokens on the same line are assumed to have the same precedence level 
and associativity; the lines are listed in order of increasing precedence or binding strength. Thus, 

%left '-' 
%left '*' '/' 

describes the precedence and associativity of the four arithmetic operators. Plus and minus are left 
associative, and have lower precedence than star and slash, which are also left associative. The key- 
word %right is used to describe right associative operators, and the keyword %nonassoc is used to 
describe operators, like the operator .LT. in Fortran, that may not associate with themselves; thus, 

...... ., ,A ; .LT. ; B, .LT. G . ,. . -- -.. t * 

is illegal in Fortran, and such an operator would be described with the keyword %nonassoc in 
Yacc. As an example of the behavior of these declarations, the description 

%right '=' 
%left '+' '-' 
%left '*' V 

%% 

expr '=' expr 
expr expr 
expr '— ' expr 
expr '*' expr 
expr V expr 
NAME 

might be used to structure the input 

a = b = c*d - e - f*g 

as follows: 

a = (b = (((c*d)-c)-(f*g))) 

When this mechanism is used, unary operators must, in general, be given a precedence. Sometimes 
a unary operator and a binary operator have the same symbolic representation, but different pre- 
cedences. An example is unary and binary '- unary minus may be given the same strength as 
multiplication, or even higher, while binary minus has a lower strength than multiplication. The 
keyword, %prcc, changes the precedence level associated with a particular grammar rule. %prcc 
appears immediately after the body of the grammar rule, before the action or closing semicolon, 
and is followed by a token name or literal. It causes the precedence of the grammar rule to 
become that of the following token name or literal. For example, to make unary minus have the 
same precedence as multiplication the rules might resemble: 



expr : 
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%teft '-' 

%lcft V V 

%% 

expr : expr expr 

S expr expr 

j expr '*' expr 

I expr '/' expr 

j '-' expr %prec 

| NAME 



A token declared by %left, %right, and %nonassoc need not be, but may be, declared by 
%token as well. 

The precedences and associativities arc used by Yacc to resolve parsing conflicts; they give 
rise to disambiguating rules. Formally, the rules work as follows: 

1. The precedences and associativities arc recorded for those tokens and literals that have them. 

2. A precedence and associativity is associated with each grammar rule; it is the precedence and 
associativity of the last token or literal in the body of the rule. If the %prec construction is 
used, it overrides this default. Some grammar rules may have no precedence and associativity 
associated with them. 

3. When there is a reduce/reduce conflict, or there is a shift/reduce conflict and either the input 
symbol or the grammar rule has no precedence and associativity, then the two disambiguating 
rules given at the beginning of the section arc used, and the conflicts are reported. 

4. If there is a shift/reduce conflict, and both the grammar ailc and the input character have 
precedence and associativity associated with them, then the conflict is resolved in favor of the 
action (shift or reduce) associated with the higher precedence. If the precedences are the 
same, then the associativity is used; left associative implies reduce, right associative implies 
shift, and nonassociating implies error. 

Conflicts resolved by precedence arc not counted in the number of shift/ reduce and 
reduce/ reduce conflicts reported by Yacc. This means that mistakes in the specification of pre- 
cedences may disguise errors in the input grammar; it is a good idea to be sparing with pre- 
cedences, and use them in an essentially "cookbook" fashion, until some experience has been 
gained. The y.ouiput file is very useful in deciding whether the parser is actually doing what was 
intended. 

7: Error Handling 

Error handling is an extremely difficult area, and many of the problems are semantic ones. 
When an error is found, for example, it may be necessary to reclaim parse tree storage, delete or 
alter symbol table entries, and, typically, set switches to avoid generating any further output. 

It is seldom acceptable to stop all processing when an error is found; it is more useful to con- 
tinue scanning the input to find further syntax errors. This leads to the problem of getting the 
parser "restarted" after an error. A general class of algorithms to do this involves discarding a 
number of tokens from the input string, and attempting to adjust the parser so that input can con- 
tinue. 

To allow the user some control over this process, Yacc provides a simple, but reasonably gen- 
eral, feature. The token name "error" is reserved for error handling. 1'his name can be used in 
grammar rules; in effect, it suggests places where errors arc expected, and recovery might take 
place. The parser pops its stack until it enters a state where the token "error" is legal. It then 
behaves as if the token "error" were the current lookahcad token, and performs the action encoun- 
tered. The lookahcad token is then reset to the token that caused the error. If no special error 
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rules have been specified, the processing halts when an error is detected. 

In order to prevent a cascade of error messages, the parser, after detecting an error, remains 
in error state until three tokens have been successfully read and shifted. If an error is detected 
when the parser is already in error state, no message is given, and the input token is quietly deleted. 

As an example, a rule of the form 
stat : error 

would, in effect, mean that on a syntax error the parser would attempt to skip over the statement in 
which the error was seen. More precisely, the parser will scan ahead, looking for diree tokens that 
might legally follow a statement, and start processing at the first of these; if the beginnings of state- 
ments are not sufficiently distinctive, it may make a false start in the middle of a statement, and end 
up reporting a second error where there is in fact no error. 

Actions may be used with these, special. error rules.,. These actions might attempt tq^ reinitialize 
tables, reclaim symbol table space, etc. 

Error rules such as the above are very general, but difficult to control. Somewhat easier are 
rules such as 

stat : error ';' 

Here, when there is an error, the parser attempts to skip over the statement, but will do so by skip- 
ping to the next All tokens after the error and before the next ';' cannot be shifted, and are 
discarded. When the ';' is seen, this rule will be reduced, and any "cleanup" action associated with 
it performed. 

Another form of error rule arises in interactive applications, where it may be desirable to per- 
mit a line to be reentered after an error. A possible error rule might be 

input : error '\n' { printfX "Reenter last line: " ); } input 
{ $$ = $4; } 

There is one potential difficulty with this approach; the parser must correctly process three input 
tokens before it admits that it has correctly rcsynch ionized after the error. If the reentered line 
contains an error in the first two tokens, the parser deletes the offending tokens, and gives no mes- 
sage; this is clearly unacceptable. For this reason, there is a mechanism that can be used to force 
the parser to believe that an error has been fully recovered from. The statement 

yycrrok ; 

in an action resets the parser to its normal mode. The last example is better written 

input : error '\n' 

{ yycrrok; 

prints "Reenter last line: " ); } 

input 

{ $$ = $4; } 

! 

As mentioned above, the token seen immediately after the "error" symbol is the input token 
at which the error was discovered. Sometimes, this is inappropriate; for example, an error recovery 
action might take upon itself the job of finding the correct place to resume input. In this case, the 
previous lookahcad token must be cleared. The statement 

yyclearin ; 

in an action will have this effect. For example, suppose the action after error were to call some 
sophisticated rcsynchroniwition routine, supplied by the user, that attempted to advance the input to 
the beginning of the next valid statement. After this routine was called, the next token returned by 
yylex would presumably be the first token in a legal statement; the old, illegal token must be dis- 
carded, and the error state reset. This could be done by a rule like 
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stat : error 

{ rcsynchO; 
yycrrok ; 
yyclearin ; } 

These mechanisms are admittedly crude, but do allow for a simple, fairly effective recovery of 
the parser from many errors; moreover, the user can get control to deal with the error actions 
required by odicr portions of the program. 

8: The Yacc Environment 

When die user inputs a specification to Yacc, the output is a file of C programs, called y.tab.c 
on most systems (due to local file system conventions, the names may differ from installation to 
installation). The function produced by Yacc is called 'yyparse;' it is an integer valued function. 
When it is called, it in turn repeatedly calls yylex, the lexical analyzer supplied by die user (sec 
Section 3) to obtain input tokens. Eventually, cither an error is detected, in which case (if no error 
recovery is possible) yyparse returns the value 1, or the lexical analyzer returns the endmarker 
token and the parser accepts. In this case, yyparse returns die value 0. 

The user must provide a certain amount of environment for this parser in order to obtain a 
working program. For example, as with every C program, a program called main must be defined, 
diat eventually calls yyparse. In addition, a routine called yyerror prints a message when a syntax 
error is detected. 

These two routines must be supplied in one form or another by the user. To case the initial 
effort of using Yacc, a library has been provided with default versions of main and yyerror. The 
name of this library is system dependent; on many systems the library is accessed by a — ly argu- 
ment to the loader. To show the triviality of these default programs, die source is given below: 

main(){ 

rcturn( yyparsc() ); 
} 

and 

# include <stdio.h> 

yycrror(s) char *s; { 

fprintfl( stdcrr, "%s\n'\ s ); 
} 

The argument to yyerror is a string containing an error message, usually the string "syntax error". 
The average application will want to do better than Uiis. Ordinarily, the program should keep track 
of the input line number, and print it along with the message when a syntax error is detected. The 
external integer variable yychar contains the lookahcad token number at the time the error was 
detected; this may be of some interest in giving better diagnostics. Since the main program is 
probably supplied by the user (to read arguments, etc.) the Yacc library is useful only in small pro- 
jects, or in the earliest stages of larger ones. 

The external integer variable yydebug is normally set to 0. If it is set to a nonzero value, die 
parser will output a verbose description of its actions, including a discussion of which input sym- 
bols have been read, and what the parser actions are. Depending on the operating environment, it 
may be possible to set tliis variable by using a debugging system. 

9: Mints for Preparing Specifications 

This section contains miscellaneous hints on preparing efficient, easy to change, and clear 
specifications. The individual subsections arc more or less independent 
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Input Style 

It is difficult to provide rules with substantial actions and still have a readable specification 
file. The following style hints owe much to Brian Kcrnighan. 

a. Use all capital letters for token names, all lower case letters for nonterminal names. This rule 
comes under die heading of "knowing who to blame when things go wrong." 

b. Put grammar rules and actions on separate lines. This allows either to be changed without an 
automatic need to change the other. 

c. Put all ailes with the same left hand side together. Put the left hand side in only once, and 
let all following rules begin with a vertical bar. 

d. Put a semicolon only after the last ailc with a given left hand side, and put the semicolon on 
a separate line. This allows new rules to be easily added. 

. e, . Indent rule bodies by two tab stops, and action bodies by three ta^ $tops. 

The example in Appendix A is written following this style, as arc the examples in the text of 
this paper (where space permits). The user must make up his own mind about these stylistic ques- 
tions; the central problem, however, is to make the rules visible through the morass of action code. 

Left Recursion 

The algorithm used by the Yacc parser encourages so called "left recursive" grammar rules: 
rules of the form 

name: name rcst_of_rulc ; 

These rules frequently arise when writing specifications of sequences and lists: 

list : item 

| list Y item 
> 

and 

scq : item 

| scq item 
» 

In each -of these cases, the first rule will be reduced for the first item only, and the second rule will 
be reduced for the second and all succeeding items. 

With right recursive rules, such as 

scq : item 

| item seq 

» ■ 

the parser would be a bit bigger, and the items would be seen, and reduced, from right to left. 
More seriously, an internal stack in the parser would be in danger of overflowing if a very long 
sequence were read. Thus, the user should use left recursion wherever reasonable. 

It is worth considering whether a sequence with zero elements has any meaning, and if so, 
consider writing the sequence specification with an empty iule: 

seq : /* empty */ 
| scq item 
» 

Once again, the first Rile would always be reduced exactly once, before the first item was read, and 
then the second rule would be reduced once for each item read. Permitting empty sequences often 
leads to increased generality. However, conflicts might arise if Yacc is asked to decide which 
empty sequence it has seen, when it hasn't seen enough to know! 
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Lcxical Tic-ins 

Some lexical decisions depend on context. For example, the lexical analyzer might want to 
delete blanks normally, but not within quoted strings. Or names might be entered into a symbol 
table in declarations, but not in expressions. 

One way of handling this situation is to create a global flag that is examined by the lexical 
analyzer, and set by actions. For example, suppose a program consists of 0 or more declarations, 
followed by 0 or more statements. Consider: 

%{ 

int dflag; 

%} 

... other declarations ... 
%% 

prog : decls stats 



dccls : /* empty */ 

{ dflag = 1; } 
I dccls declaration 



stats : /* empty */ 

{ dflag = 0; } 
I stats statement 



... other rules ... 

The flag dflag is now 0 when reading statements, and 1 when reading declarations, except for the 
firs! token in the first statement: This token must be seen by the paiscr before it can tell that the 
declaration section has ended and the statements have begun. In many cases, this single token 
exception docs not affect the lexical scan. 

This kind of "backdoor" approach can be elaborated to a noxious degree. Nevertheless, it 
represents a way of doing some things that arc difficult, if not impossible, to do otherwise. 

Reserved Words 

Some programming languages permit the user to use words like "if, which arc normally 
reserved, as label or variable names, provided that such use docs not conflict with the legal use of 
these names in the programming language. This is extremely hard to do in the framework of Yacc; 
it is difficult to pass information to the lexical analyzer telling it "this instance of *if is a keyword, 
and that instance is a variable". The user can make a stab at it, using the mechanism described in 
the last subsection, but it is difficult. 

A number of ways of making this easier arc under advisement. Until then, it is better that the 
keywords be reserved ; that is, be forbidden for use as variable names. There arc powerful stylistic 
reasons for preferring this, anyway. 

10: Advanced Topics 

This section discusses a number of advanced features of Yacc. 
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Simulating Error and Accept in Actions 

The parsing actions of error and accept can be simulated in an action by use of macros 
YY ACCEPT and YYERROR. YYACCEPT causes yyparse to return the value 0; YYERROR 
causes the parser to behave as if the current input symbol had been a syntax error; yyerror is 
called, and error recovery takes place. These mechanisms can be used to simulate parsers with 
multiple endmarkers or context-sensitive syntax checking. 

Accessing Values in Enclosing Rules. 

An action may refer to values returned by actions to the left of the current rule. The 
mechanism is simply the same as with ordinary actions, a dollar sign followed by a digit, but in this 
case the digit may be 0 or negative. Consider 

sent : adj noun verb adj noun 

{ look at the sentence . -. .< } 1 * y , 



adj : THE { $$ = THE; } 

| YOUNG { $$ = YOUNG; } 



noun : DOG 

{ $$ = DOG; } 
| CRONE 

{ ifX $0 = = YOUNG ){ 
printfT M what?\n" ); 
} 

$$ = CRONE; 
} 



In the action following the word CRONE, a check is made that the preceding token shifted was not 
YOUNG. Obviously, this is only possible when a great deal is known about what might precede 
the symbol noun in the input. There is also a distinctly unstructured flavor about this. Neverthe- 
less, at times this mechanism will save a great deal of trouble, especially when a few combinations 
arc to be excluded from an otherwise regular structure. 

Support for Arbitrary Value Types 

By default, the values returned by actions and the lexical analyzer arc integers. Yacc can also 
support values of other types, including structures. In addition, Yacc keeps track of the types, and 
inserts appropriate union member names so that the resulting parser will be strictly type checked. 
The Yacc value stack (sec Section 4) is declared to be a union of the various types of values 
desired. The user declares the union, and associates union member names to each token and non- 
terminal symbol having a value. When the value is referenced through a $$ or $n construction, 
Yacc will automatically insert the appropriate union name, so that no unwanted conversions will 
take place. In addition, type checking commands such as Lint* will be far more silent. 

There arc three mechanisms used to provide for this typing. First, there is a way of defining 
the union; this must be done by the user since other programs, notably the lexical analyzer, must 
know about the union member names. Second, there is a way of associating a union member name 
with tokens and nonterminals. Finally, there is a mechanism for describing the type of those few 
values where Yacc can not easily determine the type. 

To declare the union, the user includes in" the declaration section: 
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%union { 

body of union ... 
} 

This declares the Yacc value stack, and the external variables yylval and yyval, to have type equal 
to this union. If Yacc was invoked with the -d option, the union declaration is copied onto the 
y.tab.h file. Alternatively, the union may be declared in a header file, and a typedef used to define 
the variable YYSTYPE to represent this union. Thus, the header file might also have said: 

typedef union { 

body of union ... 
} YYSTYPE; 

The header file must be included in the declarations section, by use of %{ and %}. 

: : Once YYSTYPE is defined, the union member names must be assb&atcd with the various ter- 
minal and nonterminal names. The construction 

< name > 

is used to indicate a union member name. If this follows one of die keywords %token, %left, 
%right, and %nonassoc, the union member name is associated with the tokens listed. Thus, saying 

%lcft <optype> '+' '-' 

will cause any reference to values returned by these two tokens to be tagged with the union 
member name optype. Another keyword, %typc, is used similarly to associate union member 
names with nonterminals. Thus, one might say 

%typc <nodctype> cxpr stat 

There remain a couple of cases where these mechanisms are insufficient. If there is an action 
within a rule, the value returned by this action has no a priori type. Similarly, reference to left 
context values (such as $0 — sec the previous subsection ) leaves Yacc with no easy way of know- 
ing the type. In this case, a type can be imposed on the reference by inserting a union member 
name, between < and >, immediately after the first $. An example of this usage is 

rule : aaa { $<intval>$ = 3; } bbb 

{ fun( $<intval>2, $<othcr>0 ); } 

> - • 

This syntax has little to recommend it, but the situation arises rarely. 

A sample specification is given in Appendix C. The facilities in this subsection arc not trig- 
gered until they arc used: in particular, the use of %typc will turn on these mechanisms. When 
they are used, there is a fairly strict level of checking. For example, use of $n or $$ to refer to 
something with no defined type is diagnosed. If these facilities arc not triggered, the Yacc value 
stack is used to hold inl\ as was true historically. 
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Appcndix A: A Simple Example 

This example gives the complete Yacc specification for a small desk calculator; the desk cal- 
culator has 26 registers, labeled "a" through "z", and accepts arithmetic expressions made up of the 
operators +, —,*,/,% (mod operator), & (bitwise and), | (bitwise or), and assignment. If an 
expression at the top level is an assignment, the value is not printed; otherwise it is. As in C, an 
integer that begins with 0 (zero) is assumed to be octal; otherwise, it is assumed to be decimal. 

As an example of a Yacc specification, the desk calculator does a reasonable job of showing 
how precedences and ambiguities are used, and demonstrating simple error recovery. The major 
oversimplifications are that the lexical analysis phase is much simpler than for most applications, 
and the output is produced immediately, line by line. Note the way that decimal and octal integers 
are read in by the grammar rules; This job is probably better done by the lexical analyzer. 



%{ 

# include <stdio.h> 

# include <ctypc.h> 

int rcgs[26J; 
int base; 



%} 

%stait list 

%token -DIGIT LETTER 

%left T 

%lcft 

%lcft 

%lcft '*' V '%' 

%lcft UMINUS /* supplies precedence for unary minus */ 

%% /* beginning of rules section */ 



list : /* empty */ 
I list stat '\n' 



expr 



list error '\n' 

{ yycrrok; } 



stat : cxpr 

{ printll "%d\n", $1 ); } 

| LEITER '=' cxpr 

{ rcgs[$l] = $3; } 



'(' cxpr ')' 








{ 


$$ 


= $2; } 




expr expr 








{ 


$$ 


= $1 + 


$3; } 


cxpr '-' expr 








{ 


$$ 


= $1 - 


.$3; } 


cxpr '*' cxpr 








{ 


$$ 


= $1 * 


$3; } 
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expr 7' cxpr 








{ 


$$ 


zzz 


$1 / $3; } 


expr '%' cxpr 








{ 


$$ 


= 


$1 % $3; } 


expr '&' expr 








{ 


$$ 


= 


$1 & $3; } 


expr T expr 








{ 


$$ 




$1 | $3; } 


'-' expr 


%pi 


CC 


UMINUS 


{ 


$$ 




- $2; } 


LETTER 








{ 


$$ 




regs[$l]; } 


number 









number : DIGIT 

{ $$ = $1; base = ($1==0) ? 8 : 10; } 
| number DIGIT 

{ $$ = base * $1 + $2; } 



%% /* start of programs */ 

yylexQ { /* lexical analysis routine */ 

/* returns LETTER for a lower case letter, yylval = 0 through 25 */ 

/* return DIGIT for a digit, yylval = 0 through 9 */ 

/* all other characters arc returned immediately */ 

int c; 

whilc( (c = gctchar()) == ''){/* skip blanks */ } 

/* c is now nonblank */ 

if{ islowcr( c ) ) { 

yylval = c — 'a'; 
return ( LETTER ); 
} 

ifC isdigit( c ) ) { 

yylval = c - '0'; 
rcturn( DIGIT ); 
} 

rcturn( c ); 
} 
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Appcndix B: Yacc Input Syntax 

This Appendix has a description of the Yacc input syntax, as a Yacc specification. Context 
dependencies, etc., are not considered. Ironically, the Yacc input specification language is most 
naturally specified as an LR(2) grammar; the sticky part comes when an identifier is seen in a rule, 
immediately following an action. If this identifier is followed by a colon, it is the start of the next 
rule; otherwise it is a continuation of the current rule, which just happens to have an action embed- 
ded in it. As implemented, the lexical analyzer looks ahead after seeing an identifier, and decide 
whether the next token (skipping blanks, ncwlines, comments, etc.) is a colon. If so, it returns the 
token C_1DENT1FIER. Otherwise, it returns IDENTIFIER. Literals (quoted strings) are also 
returned as IDEN TIFIERS, but never as part of CJDENTIFIERs. 

/* grammar for the input to Yacc */ 

/* basic entities */ 

%token IDENTIFIER /* includes identifiers and literals */ 

%tokcn C_I IDENTIFIER /* identifier (but not literal) followed by colon */ 

%tokcn NUMBER /* [0-9]+ */ 

7* reserved words: %typc => TYPE, %lcft => LEFT, etc. */ 

%token LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION 

%tokcn MARK /* the %% mark */ 
%tokcn LCURL /* the %{ mark */ 
%token RCURL /* the %} mark */ 

/* ascii character literals stand for themselves */ 

%start spec 

%% 

spec : defs MARK rules tail 



tail : MARK { In this action, eat up the rest of the file 

| /* empty: the second MARK is optional */ 



dels : /* empty */ 

I defs def 



def : START IDENTIFIER 

| UNION { Copy union definition to output } 

j LCURL { Copy C code to output fde } RCURL 

| ndefs rword tag nlist 



mord : TOKEN 

| LEFT 

I RIGHT 

! NONASSOC 
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| TYPE 



tag 



nlist 



nmno 



rules 



rule 



rbody 



/* empty: union tag is optional */ 
'<' IDENTIFIER >' 



nmno 

nlist nmno 
nlist ',' nmno 



IDENTIFIER /* NOTE: literal illegal with %type */ 

IDENTIFIER NUMBER , /* NOTE: illegal with Jotype */ . 



/* rules section */ 



CJDENT1FIER rbody prec 
rules rule 



CJDENTIFIER rbody prec 
T rbody prec 



/* empty */ 
rbody IDENTIFIER 
rbody act 



act 



'{' { Copy action, translate $$, etc. } '}' 



prec 



/* empty */ 
PREC IDENTIFIER 
PREC IDENTIFIER act 
prcc ';' 
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Appendix C: An Advanced Example 

This Appendix gives an example of a grammar using some of the advanced features discussed 
in Section 10. The desk calculator example in Appendix A is modified to provide a desk calculator 
that does floating point interval arithmetic. The calculator understands floating point constants, the 
arithmetic operations +, -,*,/, unary — , and = (assignment), and has 26 floating point vari- 
ables, "a" through "z". Moreover, it also understands intervals, written 

(x,y) 

where x is less than or equal to y. There are 26 interval valued variables "A" through "Z M that 
may also be used. The usage is similar to that in Appendix A; assignments return no value, and 
print nothing, while expressions print the (floating or interval) value. 

This example explores a number of interesting features of Yacc and C. Intervals are 
represented by a staicture, consisting of the left and right endpoint values, stored as double's. This 
structure is given a type name, INTERVAL, by using typedef. The Yacc value stack can also con- 
tain floating point scalars, and integers (used to index into the arrays holding the variable values). 
Notice that this entire strategy depends strongly on being able to assign structures and unions in C. 
In fact, many of the actions call functions that return structures as well. 

It is also worth noting die use of YYERROR to handle error conditions: division by an inter- 
val containing 0, and an interval presented in the wrong order. In effect, the error recovery 
mechanism of Yacc is used to throw away the rest of the offending line. 

. In addition to the mixing of types on the value stack, this grammar also demonstrates an 
interesting use of syntax to keep track of the type (e.g. scalar or interval) of intermediate expres- 
sions. Note that a scalar can be automatically promoted to an interval if the context demands an 
interval value. This causes a large number of conflicts when the grammar is run through Yacc: 18 
Shift/Reduce and 26 Reduce/Reduce. The problem can be seen by looking at the two input lines: 

2.5 + ( 3.5 - 4. ) 

and 

2.5 + ( 3.5 , 4: ) 

Notice that the 2.5 is to be used in an interval valued expression in the second example, but this 
fact is not known until the is read; by this time, 2.5 is finished, and the parser cannot go back 
and change its mind. More generally, it might be necessary to look ahead an arbitrary number of 
tokens to decide whether to convert a scalar to an interval. This problem is evaded by having two 
rules for each binary interval valued operator: one when the left operand is a scalar, and one when 
tlic left operand is an interval. In the second case, the right operand must be an interval, so the 
conversion will be applied automatically. Despite this evasion, there arc still many cases where the 
conversion may be applied or not, leading to the above conflicts. They arc resolved by listing die 
rules that yield scalars first in the specification file; in this way, the conflicts will be resolved in the 
direction of keeping scalar valued expressions scalar valued until they arc forced to become inter- 
vals. 

This way of handling multiple types is very instructive, but not very general. If there were 
many kinds of expression types, instead of just two, the number of rules needed would increase 
dramatically, and the conflicts even more dramatically. Thus, while this example is instructive, it is 
better practice in a more normal programming language environment to keep the type information 
as part of the value, and not as part of the grammar. 

Finally, a word about the lexical analysis. The only unusual feature is the treattnent of float- 
ing point constants. The C library routine atof is used to do the actual conversion from a character 
string to a double precision value. If the lexical analyzer detects an error, it responds by returning 
a token that is illegal in the grammar, provoking a syntax error in the parser, and thence error 
recovery. 
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%{ 

# include <stdio.h> 

# include <ctypc.h> 

typedef struct interval { . 
double lo, hi; 
} INTERVAL; 

INTERVAL vmulO, vdivO; 

double atofO; 

double drcg[ 26 ]; 
INTERVAL vreg[26]; 

%} 

%start lines 

%union { 
int ival; 
double dval; 
INTERVAL vval; 
} 

%tokcn <ival> DREG VREG /* indices into dreg, vreg arrays */ 
%tokcn <dval> CONST /* floating point constant */ 

%typc <dval> dexp /* expression */ . 

%typc <vval> vexp /* interval expression */ 

/* precedence information about the operators */ 
%lcft'+' '-' , ... 

%icft v v . / , : t : ■• 1 

%lcft UMINUS /* precedence for unary minus */ 
%% 

lines : /* empty */ 
| lines line 



line : dexp An' 

{ prints "%15.8f\n", $1 ); } 
| vexp '\n' 

{ printlX "(%15.8f , %I5.8f )\n", $l.lo, $l.hi ); } 
| DREG '=' dexp '\n' 

{ drcg[$I] = $3; } 
| VREG vexp '\n' 
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{ vreg[$l] = $3; } 
| error '\n' 

{ yyerrok; } 



dexp : CONST 
| DREG 

{ $$ = drcg[$l]; } 
| dexp dexp 

{ $$ = $1 + $3; } 
| dexp '-' dexp 

{ $$ = $1 - $3; } 
j dexp '*' dexp 

{ $$ = $1 * 43; } . 
| dexp V dexp 

{ $$ = $1 / $3; } 
| '-' dexp %prcc UMINUS 

{ $$ = - $2; } 
| '(' dexp ')' 

{ $$ = $2; } 



vexp : dexp 

{ $$.hi = $$.lo = $1; } 
| '(' dexp ',' dexp ')' 
{ 

$$.lo = $2; 
$$.hi = $4; 
iff $$.lo > $$.hi ){ 

piintfX "interval out of ordcr\n" ); 

YYERROR; 

} 

} 

| VREG 

{ $$ = vrcg[$l]; } 
| vexp '+' vexp 

{ $$.hi = $Lhi + $3.hi; 
$$.lo = $l.lo + $3.1o; } 
| dexp '+' vexp 

{ $$.hi = $1 + $3.hi; 
$$.lo = $1 + $3.1o; } 
| vexp '— ' vexp 

{ $$.hi = $ I .hi - $3.1o; 
$$.lo = $l.lo - $3.hi; } 
| dexp '-' vexp 

{ $$.hi = $1 - $3.lo; 
$$.lo = $1 - $3.hi; } 
| vexp '*' vexp 

{ $$ = vmul( $l.lo, $l.hi, $3 );■ } 
| dexp '*' vexp 

{ $$ = vmuK $1, $1, $3 ); } 
| vexp '/' vexp 

{ iff dchcck( $3 ) ) YY ERROR; 
$$ = vdiv( $l.lo, $l.hi, $3 ); } 
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| dexp V vexp 

{ if( dcheck( $3 ) ) YYERROR; 
$$ = vdiv( $1, $1, $3 ); } 
| '-' vexp %prcc UMINUS 

{ $$.hi = -$2.1o; $$.lo = -$2.hi; } 
I '(' vexp ')' 

{ $$ = $2; } 



%% 

# define BSZ 50 /* buffer size for floating point numbers */ 
~. -r i* lexical < analysis */ '■■ > > •'■ \ -■; v ' t * 

yylex0{ 

register c; 

while( (c=gctcharO) == ''){/* skip over blanks */ } 

if( isupper( c ) ){ 

yylval.ival = c — 'A'; 
return( VREG ); 
} 

if( islowcK c ) ){ 

yylval.ival = c — 'a'; 
rcturn( DREG ); 
} 

if( isdigit( c ) || c= = V ){ 

/* gobble up digits, points, exponents */ 

char bulIBSZ+1], *cp = buf; 
int dot = 0, cxp = 0; 

foi< ; (cp-buf)<BSZ ; + +cp,c=gctchai<) ){ 

*cp = c; 

ifl( isdigit( c ) ) continue; 
ill c == V ){ 

if( dot++ || cxp ) rcturn( '.' ); /* will cause syntax error 

continue; 

} 

if* c == V ){ 

if( cxp++ ) rcturn( 'e' ); /* will cause syntax error */ 

continue; 

} 

/* end of number */ 

break; 

} 

*cp = '\0'; 

if( (cp-buf) >= BSZ ) printf( "constant too long: truncated\n" ); 
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else ungetc( c, stdin ); /* push back last char read */ 
yylval.dval = atof( buf ); 
return( CONST ); 
} 

return( c ); 
} 

INTERVAL hilo( a, b, c, d ) double a, b, c, d; { 

/* returns the smallest interval containing a, b, c, and d */ 
/* used by *, / routines */ 
INTERVAL v; 

if( a>b ) { v.hi = a; v.lo = b; } 
else { v.hi = b; v.lo = a; } 

if( c>d ) { 

if( c>v.hi ) v.hi = c; 
if( d<v.lo ) v.lo = d; 
} 

else { 

if( d>v.hi ) v.hi = d; 
if( c<v.lo ) v.lo = c; 
} 

return( v ); 
} 

INTERVAL vmuK a, b, v ) double a, b; INTERVAL v; { 
return( hilo( a*v.hi, a*v.lo, b*v.hi, b*v.lo ) ); 
} 

dcheck( v ) INTERVAL v; { 

if( v.hi >= 0. && v.lo <= 0. ){ 

printft "divisor interval contains OAn" ); 
rcttirn( 1 ); 

. } 
rcturn( 0 ); 

} 

INTERVAL vdiv( a, b, v ) double a, b; INTERVAL v; { 

rcturn( hilo( a/v.hi, a/v.lo, b/v.hi, b/v.lo ) ); 
} 
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Appcndix D: Old Features Supported but not Encouraged 

This Appendix mentions synonyms and features which are supported for historical continuity, 
but, for various reasons, are not encouraged. 

1. Literals may also be delimited by double quotes 

2. Literals may be more than one character long. If all the characters are alphabetic, numeric, 
or _, the type number of the literal is defined, just as if the literal did not have the quotes 
around it. Otherwise, it is difficult to find the value for such literals. 

The use of multi-character literals is likely to mislead those unfamiliar with Yacc, since it sug- 
gests that Yacc is doing a job which must be actually done by the lexical analyzer. 

3. Most places where % is legal, backslash "\" may be used. In particular, \\ is the same as 
%%, \lcft the same as %left, etc. 

4. There are a number of other synonyms: w . ; , 

%< is the same as %left 

%> is the same as %right 

%binary and %2 are the same as %nonassoc 

%0 and %tcrm arc the same as %token 

%= is the same as %prec 

5. Actions may also have the form 

= {...} 

and the curly braces can be dropped if the action is a single C statement. 

6. C code between %{ and %} used to be permitted at the head of the rules section, as well as in 
the declaration section. 
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Lex helps write programs whose control Mow is directed by instances of regular expressions in 
the input stream. It is well suited for editor-script type transformations and for segmenting input in 
preparation for a parsing routine. 

Lex source is a table of regular expressions and corresponding program fragments. The tabic is 
translated to a program which reads an input stream, copying it to an output stream and partitioning 
the input into strings which match the given expressions. As each such string is recognized the 
corresponding program fragment is executed. The recognition of the expressions is pcrfonrted by a 
deterministic finite automaton generated by Lex. The program fragments written by the user are exe- 
cuted in the order in which the corresponding regular expressions occur in the input stream. 

The lexical analysis programs written with Lex accept ambiguous specifications and choose the 
longest match possible at each input point. If necessary, substantial lookahcad is performed on the 
input, but the input stream will be backed up to the end of the current partition, so that the user has 
general freedom to manipulate it. 

I ,cx can generate analyzers in cither C or Ratfor, a language which can be translated automati- 
cally to portable Fortran. It is available on the PDP-1L UNIX, Honeywell GCOS, and IBM OS sys- 
tems. This manual, however, will only discuss generating analyzers in C on the UNIX system, which is 
the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfacing with 
Yacc, for those with access to this compiler-compiler system. 
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1. Introduction. 

Lex is a program generator designed for lex- 
ical processing of character input streams. It 
accepts a high-level, problem oriented specification 
for character string matching, and produces a pro- 
gram in a general purpose language which recog- 
nizes regular expressions. The regular expressions 
are specified by the user in the source 
specifications given to Lex. The Lex written code 
recognizes these expressions in an input stream 
and partitions the input stream into strings match- 
ing the expressions. At the boundaries between 
strings program sections provided by the user are 



executed. The Lex source file associates the regu- 
lar expressions and the program fragments. As 
each expression appears in the input to the pro- 
gram written by Lex, the corresponding fragment 
is executed. 

The user supplies the additional code 
beyond expression matching needed to complete 
his tasks, possibly including code written by other 
generators. The program that recognizes the 
expressions is generated in the general purpose 
programming language employed for the user's 
program fragments. Thus, a high level expression 
language is provided to write the string expressions 
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to be matched while the user's freedom to write 
actions is unimpaired. This avoids forcing the user 
who wishes to use a string manipulation language 
for input analysis to write processing programs in 
the same and often inappropriate string handling 
language. 

Lex is not a complete language, but rather a 
generator representing a new language feature 
which can be added to different programming 
languages, called "host languages." Just as general 
purpose languages can produce code to run on 
different computer hardware, Lex can write code 
in different host languages. The host language is 
used for the output code generated by Lex and 
also for the program fragments added by the user. 
Compatible run-time libraries for the different host 
languages are also provided. This makes Lex 
adaptable to different environments and different 
users. Each application may be directed to the 
combination of hardware and host language 
appropriate to the task, the user's background, and 
the properties of local implementations. At 
present, the only supported host language is C, 
although Fortran (in the form of Ralfor [2] has 
been available in the past. Lex itself exists on 
UNIX, GCOS, and OS/370; but the code gen- 
erated by Lex may be taken anywhere the 
appropriate compilers exist 

Lex turns the user's expressions and actions 
(called source in this memo) into the host general- 



purpose language; the generated program is named 
yylex. The yylex program will recognize expres- 
sions in a stream (called input in this memo) and 
perform the specified actions for each expression 
as it is detected. See Figure 1. 

Source -* | Lex | yylex 



Input -* | yylex | -* Output 

An overview of Lex 
Figure 1 

For a trivial example, consider a program to 
delete from the input all blanks or tabs at the ends 
of lines. 

%% 

[\t]+$ ; 

is all that is required. The program contains a %% 
delimiter to mark the beginning of the rules, and 
one rule. This rule contains a regular expression 
which matches one or more instances of the char- 
acters blank or tab (written \l for visibility, in 
accordance with the C language convention) just 
prior to the end of a line. The brackets indicate 
Uie character class made of blank and lab; the + 
indicates "one or more ..."; and the $ indicates 
"end of line," as in QED. No action is specified, 
so the program generated by Lex (yylex) will 
ignore these characters. Everything else will be 
copied. To change any remaining, string of blanks 
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or tabs to a single blank, add another rule: 
%% 

[\t]+$ ; 

[\tj+ prinlfC "); 
The finite automaton generated for this source will 
scan for both rules at once, observing at the termi- 
nation of the string of blanks or tabs whether or 
not there is a newline character, and executing the 
desired rule action. The first rule matches all 
strings of blanks or tabs at the end of lines, and 
the second rule all remaining strings of blanks or 
tabs. 

Lex can be used alone for simple transfor- 
mations, or for analysis and statistics gathering on 
a lexical level. Lex can also be used with a parser 
generator to perform the lexical analysis phase; it 
is particularly easy to interface Lex and Yacc [3]. 
Lex programs recognize only regular expressions; 
Yacc writes parsers that accept a large class of con- 
text free grammars, but require a lower level 
analyzer to recognize input tokens. Thus, a combi- 
nation of Lex and Yacc is often appropriate. 
When used as a preprocessor for a later parser 
generator, Lex is used to partition the input 
stream, and the parser generator assigns structure 
to the resulting pieces. The flow of control in such 
a case (which might be the first half of a compiler, 
for example) is shown in Figure 2. Additional 
programs, written by other generators or by hand, m 
can be added easily to programs written by Lex. 



lexical 
rules 
I 

Lex 



grammar 
rules 
I 

Yacc 



Input | yylex | -* | yyparse | -+ Parsed input 



Lex with Yacc 
Figure 2 

Yacc users will realize that the name yylex is what 
Yacc expects its lexical analyzer to be named, so 
that the use of this name by Lex simplifies inter- 
facing. 

Lex generates a deterministic finite automa- 
ton from the regular expressions in the source [4]. 
The automaton is interpreted, rather than com- 
piled, in order to save space. The result is still a 
fast analyzer. In particular, the time Uiken by a 
Lex program to recognize and partition an input 
stream is proportional to the length of the input 
The number of Lex rules or the complexity of the 
rules is not important in determining speed, unless 
rules which include forward context require a 
significant amount of rescanning. What does 
increase with the number and complexity of rules 
is the size of the finite automaton, and therefore 
the size of the program generated by Lex. 
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In the program written by Lex, the users 
fragments (representing the actions to be per- 
formed as each regular expression is found) are 
gathered as cases of a switch. The automaton 
interpreter directs the control flow. Opportunity is 
provided for the user to insert either declarations 
or additional statements in the routine containing 
the aqtions, or, to add subroutines outside this 
action routine. 

Lex is not limited to source which can be 
interpreted on the basis of one character look- 
ahead. For example, if there are two rules, one 
looking for ab and another for abcdefg, and the 
input stream is abcdefh. Lex will recognize ab and 
leave the input pointer just before cd. . . Such 
backup is more costly than the processing of 
simpler languages. 

2. Lex Source. 

The general format of Lex source is: 
{definitions} 
%% 
{rules} 
%% 

{user subroutines} 
where the definitions and the user subroutines are 
often omitted. The second %% is optional, but 
the first is required to mark the beginning of the 
rules. The absolute minimum Lex program is thus 
%% 



(no definitions, no rules) which translates into a 
program which copies the input to the output 
unchanged. 

In the outline of Lex programs shown 
above, the rules represent the user's control deci- 
sions; they are a table, in which the left column 
contains regular expressions (see section 3) and 
the right column' contains actions, program frag- 
ments to be executed when the expressions are 
recognized. Thus an individual rule might appear 

integer printfC' found keyword INT"); 
to look for the string integer in the input stream 
and print the message "found keyword INT" 
whenever it appears. In this example the host pro- 
cedural language is C and the C library function 
print/ is used to print the string. The end of the 
expression is indicated by the first blank or tab 
character. If the action is merely a single C 
expression, it can just be given on the right side of 
the line; if it is compound, or Uikes more than a 
line, it should be enclosed in braces. As a slightly 
more useful example, suppose it is desired to 
change a number of words from British to Ameri- 
can spelling. Lex rules such as 

colour printf("color"); 

mechanise printf("mechanize"); 

petrol printfCgas"); 
would be a. start. These rules are not quite 
enough, since the word petroleum would become 
gaseum; a way of dealing with this will be 



LEX— 5 



described later. 

3. Lex Regular Expressions. 

The definitions of regular expressions are 
very similar to those in QED [5]. A regular 
expression specifies a set of strings to be matched. 
It contains text characters (which match the 
corresponding characters in the strings being com- 
pared) and operator characters (which specify 
repetitions, choices, and other features). The 
letters of the alphabet and the digits are always 
text characters; thus the regular expression 
integer 

matches the string integer wherever it appears and 
the expression 

a57D 

looks for the string a57D. 

Operators. The operator characters are 
"\[]t-?. * + |()$/{}%<> 
and if they are to be used as text characters, an 
escape should be used. The quotation mark 
operator (") indicates that whatever is contained 
between a pair of quotes is to be taken as text 
characters. Thus 

xyz"++ M 

matches the string xyz+ + when it appears. Note 
that a part of a string may be quoted. It is harm- 
less but unnecessary to quote an ordinary text 
character; the expression 

"xyz++" 



is the same as the one above. Thus by quoting 
every non-alphanumeric character being used as a 
text character, the user can avoid remembering the 
list above of current operator characters, and is 
safe should further extensions to Lex lengthen the 
list 

An operator character may also be turned 
into a text character by preceding it with \ as in 
xyz\+\+ 

which is another, less readable, equivalent of the 
above expressions. Another use of the quoting 
mechanism is to get a blank into an expression; 
normally, as explained above, blanks or tabs end a 
rule. Any blank character not contained within [] 
(see below) must be quoted. Several normal C 
escapes with \ are recognized: \n is newline, \t is 
tab, and \b is backspace. To enter \ itself, use \\. 
Since newline is illegal in an expression, \n must 
be used; it is not required to escape tab and back- 
space. Every character but blank, tab, newline and 
the list above is always a text character. 

Character classes. Classes of characters can 
be specified using the operator pair []. The con- 
struction [abc] matches a single character, which 
may be a, b, or c. Within square brackets, most 
operator meanings are ignored. Only three charac- 
ters are special: these are \ - and t. The - 
character indicates ranges. For example, 

[a-zO-9<>J 
indicates the character class containing all the • 
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lower case letters, the digits, the angle brackets, 
and underline. Ranges may be given in either 
order. Using - between any pair of characters 
which are not both upper case letters, both lower 
case letters, or both digits is implementation 
dependent and will get a warning message. (E.g., 
[0— z] in ASCII is many more characters than it is 
in EBCDIC). If it is desired to include the charac- 
ter - in a character class, it should be first or last; 
thus 

[-+0-9] 
matches all the digits and Ihe two signs. 

In character classes, the t operator must 
appear as the first character after the left bracket; 
it indicates that the resulting string is to be com- 
plemented with respect to the computer character 
set. Thus 

[tabc] 

matches all characters except a, b, or c, including 
all special or control characters; or 
[ta-zA-ZJ 

is any character which is not a letter. The \ char- 
acter provides the usual escapes within character 
class brackets. 

Arbitrary character. To match almost any 
character, the operator character 

is the class of all characters except newline. 
Escaping into octal is possible although non- . 
portable: 



[\40-\176] 

matches all printable characters in the ASCII char- 
acter set, from octal 40 (blank) to octal 176 (tilde). 

Optional expressions. The operator ? indi- 
cates an optional element of an expression. Thus 
ab?c 

matches either ac or abc. 

Repealed expressions. Repetitions of classes 
are indicated by the operators * and + . 

a* 

is any number of consecutive a characters, includ- 
ing zero; while 

a+ 

is one or more instances of a. For example, 
[a-z]+ 

is all strings of lower case letters. And 
[A - Za - z][A - Za - zO - 9]* 
indicates all alphanumeric strings with a leading 
alphabetic character. This is a typical expression 
for recognizing identifiers in computer languages. 

Alternation and Grouping. The operator | 
indicates alternation: 

(ab|cd) 

matches either ab or cd Note that parentheses are 
used for grouping, although they are not necessary 
on the outside level; 

ab|cd 

would have sufficed. Parentheses can be used for 
more complex expressions: 
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(ab|cd+)?(ef> 
matches such strings as abefef, efefef, cdef, or 
cddd ; but not abc, abed, or abedef. 

Context sensitivity. Lex will recognize a 
small amount of surrounding context. The two 
simplest operators for this are t and $ . If the first 
character of an expression is f , the expression will 
only be matched at the beginning of a line (after a " 
newline character, or at the beginning of the input 
stream). This can never conflict with the other 
meaning of f , complementation of character 
classes, since that only applies within the [] opera- 
tors. If the very last character is $ , the expression 
will only be matched at the end of a line (when 
immediately followed by newline). The latter 
operator is a special case of the / operator charac- 
ter, which indicates trailing context. The expres- 
sion 

ab/cd 

matches the string ab, but only if followed by cd 
Thus 

ab$ 

is the same as 

abAn 

Left context is handled in Lex by start conditions 
as explained in section 10. If a rule is only to be 
executed when the Lex automaton interpreter is in 
start condition x, the rule should be prefixed by 
<x> 

using the angle bracket operator characters. If we 



considered "being at the beginning of a line" to be 
start condition ONE. then the * operator would 
be equivalent to 

<ONE> 

Start conditions are explained more fully later. 

Repetitions and Definitions. The operators 
{} specify either repetitions (if they enclose 
. numbers) or definition* expansion (if they enclose a 
name). For example 

{digit} 

looks for a predefined string named digit and 
inserts it at that point in the expression. The 
definitions are given in the first part of the Lex 
input, before the rules. In contrast, 
a{1.5} 

looks for 1 to 5 occurrences of a . 

Finally, initial % is special, being the 
separator for Lex source segments. 

4. Lex Actions. 

When an expression written as above is 
matched, Lex executes the corresponding action. 
This section describes some features of Lex which 
aid in writing actions. Note that there is a default 
action, which consists of copying the input to the 
output. This is performed on all strings not other- 
wise matched. Thus the Lex user who wishes to 
absorb the entire input, without producing any 
output, must provide rules to match everything. 
When Lex is being used with Yacc, this is the nor- 
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mal situation. One may consider that actions are 
what is done instead of copying the input to the 
output; thus, in general, a rule which merely 
copies can be omitted. Also, a character combina- 
tion which is omitted from the rules and which 
appears as input is likely to be printed on the out- 
put, thus calling attention to the gap in the rules. 

One of the simplest things that can be done 
is to ignore the input. Specifying a C null state- 
ment, ; as an action causes this result. A frequent 
rule is 

[ \t\n] ; 

which causes the three spacing characters (blank, 
tab, and newline) to be ignored. 

Another easy way to avoid writing actions is 
the action character |, which indicates that the 
action for this rule is the action for the next rule. 
The previous example could also have been writ- 
ten 

t» ti 

"\t" 
"\n" 

with the same result, although in different style. 
The quotes around \n and \t are not required. 

In more complex actions, the user will often 
want to know the actual text that matched some 
expression like [a—z]+ . Lex leaves this text in 
an external character array named yytext. Thus, 
to print the name found, a rule like 
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[a-z]+ printfC%s", yytext); 
will print the string in yytext. The C function 
print/ accepts a format argument and data to be 
printed; in this case, the format is "print string" (% 
indicating data conversion, and s indicating string 
type), and the data are the characters in yytext. 
So this just places the matched string on the out- 
put. This action is so common that it may be writ- 
ten as ECHO: 

[a-z]+ ECHO; 
is the same as the above. Since the default action 
is just to print the characters found, one might ask 
why give a rule, like this one, which merely 
specifies the default action? Such rules are often 
required to avoid matching some other rule which 
is not desired. For example, if there is a rule 
which matches read it will normally match the 
instances of read contained in bread or readjust ; 
to avoid this, a rule of the form fa—zj-f- is 
needed. This is explained further below. 

Sometimes it is more convenient to know 
the end of what has been found; hence Lex also 
provides a count yyleng of the number of charac- 
ters matched. To count both the number of words 
and the number of characters in words in the 
input, the user might write 
[a - zA - Z] + { words + + ; chars + = yyleng; } 
which accumulates in chars the number of charac- 
ters in the words recognized. The last character in 
the string matched can be accessed by 
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yytext[yyleng-l] 

Occasionally, a Lex action may decide that a 
rule has not recognized the correct span of charac- 
ters. Two routines are provided to aid with this 
situation. First, yymoreO can be called to indicate 
that the next input expression recognized is to be 
lacked on to the end of this input. Normally, the 
next input string would overwrite the current entry 
in yytexL Second, yyless (n) may be called to 
indicate that not all the characters matched by the 
currently successful expression are wanted right 
now. The argument n indicates the number of 
characters in yytext to be retained. Further char- 
acters previously matched are returned to the 
input. This provides the same sort of lookahead 
offered by the / operator, but in a different form. 

Example: Consider a language which 
defines a string as a set of characters between quo- 
tation (") marks, and provides that to include a " 
in a string it must be preceded by a \. The regular 
expression which matches that is somewhat confus- 
ing, so that it might be preferable to write 
\"[t"j* { 

if (yytexl[yyleng- 1] = = \V) 
yymoreO; * 

else 

... normal user processing 

} 

which will, when faced with a string such as 
"abc\"def first match the five characters "abc\; 



then the call to yymoreO will cause the next part 
of the string, "def, to be tacked on the end. Note 
that the final quote terminating the string should 
be picked up in the code labeled "normal process- 
mg . 

The function yylessO might be used to 
reprocess text in various circumstances. Consider 
the C , problem of distinguishing the ajnbiguity of 
"=-a". Suppose it is desired to treat this as 
" = — a" but print a message. A rule might be . 
= -[a-zA-Z] { 

printfC'Operator (= — ) ambiguous\n"); 

yyless(yyleng-l); 

... action for = — ... 

} 

which prints a message, returns the letter after the 
operator to the input stream, and treats the opera- 
tor as " = - *\ Alternatively it might be desired to 
treat this as "= -a". To do this, just return the 
minus sign as well as the letter to the input: 
= -[a-zA-Z] { 

printf("Operalor ( = - ) ambiguous\n"); 

yyless(yyleng-2); 

... action for = ... 

} 

will perform the other interpretation. Note that 
the expressions for the two cases might more easily 
be written 

= -/[A-Za-z]. 
in the first case and 
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=/-[A-Za-z] 
in the second; no backup would be required in the 
rule action. It is not necessary to recognize the 
whole identifier to observe the ambiguity. The 
possibility of "= -3", however, makes 
= -/[t \t\n] 

a still better rule. 

■ ■ " ■ In addition to* these routines, Lex also per- 
mits access to the I/O routines it uses. They are: 

1) inputO which returns the next input charac- 
ter; 

2) output(c) which writes the character c on 
the output; and 

3) unput(c) pushes the character c back onto 
the input Stream to be read later by inputO. 

By default these routines are provided as macro 
definitions, but the user can override them and 
supply private versions. These routines define the 
relationship between external files and internal 
characters, and must all be retained or modified 
consistently. They may be redefined, to cause 
input or output to be transmitted to or from 
strange places, including other programs or inter- 
nal memory; but the character set used must be 
consistent in all routines; a value of zero returned 
by input must mean end of file; and the relation- 
ship between unpul and input must be retained or 
the Lex lookahead will not work. Lex does not 
look ahead at all if it does not have to, but every 



rule ending in + * ? or $ or containing / implies 
lookahead. Lookahead is also necessary to match 
an expression that is a prefix of another expression. 
See below for a discussion of the character set 
used by Lex. The standard Lex library imposes a 
100 character limit on backup. 

Another Lex library routine that the user 
. will sometimes wan& to redefine is yywrapO which 
is called whenever Lex reaches an end-of-file. If 
yywrap returns a 1, Lex continues with the normal 
wrapup on end of input. Sometimes, however, it 
is convenient to arrange for more input to arrive 
from a new source. In this case, the user should 
provide a yywrap which arranges for new input 
and returns 0. This instructs Lex to continue pro- 
cessing. The default yywrap always returns I. 

This routine is also a convenient place to 
print tables, summaries, etc. at the end of a pro- 
gram. Note that it is not possible to write a nor- 
mal rule which recognizes end-of-file; the only 
access to this condition is through yywrap. In fact, 
unless a private version of inputO is supplied a file 
containing nulls cannot be handled, since a value 
of 0 returned by input is taken to be end-of-file. 

5. Ambiguous Source Rules. 

Lex can handle ambiguous specifications. 
When more than one . expression can match the 
current input, Lex chooses as follows: 

1) The longest match is preferred. 
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2) Among rules which matched the same 
number of characters, the rule given first is 
preferred. 

Thus, suppose the rules 

integer keyword action 

[a-z]+ identifier action 
to be given in that order. If the input is integers, 
it is taken as an identifier, because [a—z]+ 
matches 8 characters while integer matches only 7. 
If the input is integer, both rules match 7 charac- 
ters, and the keyword rule is selected because it 
was given first. Anything shorter (e.g. int) will not 
match the expression integer and so the identifier 
interpretation is used. 

The principle of preferring the longest 
match makes rules containing expressions like .* 
dangerous. For example, 
'•*' 

might seem a good way of recognizing a string in 
single quotes. But it is an invitation for the pro- 
gram to read far ahead, looking for a distant single 
quote. Presented with the input 

'first' quoted string here, 'second' here 
the above expression will match 

'first' quoted siring here, 'second' 
which is probably not what was wanted. A better 
rule is of the form 

'[An]*' 

which, on the above input, will stop after 'first' '. 
The consequences of errors like this are mitigated 



by the fact that the . operator will not match new- 
line. Thus expressions like .* stop on the current 
line. Don't try to defeat this with expressions like 
[\n]+ or equivalents; the Lex generated program 
will try to read the entire input file, causing inter- 
nal buffer overflows. 

Note that Lex is normally partitioning the 
input stream, not searching for all possible matches 
of each expression. This means that each charac- 
ter is accounted for once and only once. For 
example, suppose it is desired to count occurrences 
of both she and he in an input text. Some Lex 
rules to do this might be 

she s+ + ; 

he h++; 

\n | 

where the last two rules ignore everything besides 
he and she. Remember that . does not include 
newline. Since she includes he. Lex will normally 
not recognize the instances of he included in she, 
since once it has passed a she those characters are 
gone. 

Sometimes the user would like to override 
this choice. The action REJECT means "go do 
the next alternative." It causes whatever rule was 
second choice after the current rule to be exe- 
cuted. The position of the input pointer is 
adjusted accordingly. Suppose the user really 
wants to count the included instances of he: 
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she {s+ + ; REJECT;} 
he {h++; REJECT;} 
\n | 

.« ♦ 

these rules are one way of changing the previous 
example to do just that. After counting each 
expression, it is rejected; whenever appropriate, 
the other expression will then be counted. In this 
example, of course, the user could note that she 
includes he but not vice versa, and omit the 
REJECT action on he: in other cases, however, it 
would not be possible a priori to tell which input 
characters were in both classes. 

Consider the two rules 

a[bc]+ { ... ; REJECT;} 
a[cd]+ { ... ; REJECT;} 
If the input is ab, only the first rule matches, and 
on ad only the second matches. The input string 
accb matches the first rule for four characters and 
then the second rule for three characters. In con- 
trast, the input accd agrees with the second rule 
for four characters and then the first rule for three. 

In general, REJECT is useful whenever the 
purpose of Lex is not to partition the input stream 
but to delect all examples of some items in the 
input, and the instances of these items may overlap 
or include each other. Suppose a digram table of 
the input is desired; normally the digrams overlap, 
that is the word the is considered to contain both 
th and he. Assuming a two-dimensional array 



named digram to be incremented, the appropriate 

source is 

%% 

[a- z][a- z] {digram[yytext[0]][yy texlll]]+ + ; REJECT; 
\n ; 

where the REJECT is necessary to pick up a letter 
pair beginning at every character, rather than at 
every other character. 

6. Lex Source Definitions. 

Remember the format of the Lex source: 
{definitions} 
%% 
{rules} 
%% 

{user routines} 
So far only the rules have been described. The 
user needs additional options, though, to define 
variables for use in his program and for use by 
Lex. These can go either in the definitions section 
or in the rules section. 

Remember that Lex is turning the rules into . 
a program. Any source not intercepted by Lex is 
copied into the generated program. There are 
three classes of such things. 

1) Any line which is not part of a Lex rule or 
action which begins with a blank or tab is 
copied into the Lex generated program. 
Such source input prior to the first %% del- 
imiter will be external to any function in the 
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code; if it appears immediately after the first 
%%, it appears in an appropriate place for 
declarations in the function written by Lex 
which contains the actions. This material 
must look like program fragments, and 
should precede the first Lex rule. 

As a side effect of the above, lines which 
• begin with a blank or lab, and which con- < " 
tain a comment, are passed through to the 
generated program. This can be used to 
include comments in either the Lex source 
or the generated code. The comments 
should follow the host language convention. 

2) Anything included between lines containing 
only %{ and %} is copied out as above. 
The delimiters are discarded. This format 
permits entering text like preprocessor state- 
ments that must begin in column 1, or copy- 
ing lines that do not look like programs. 

3) Anything after the third %% delimiter, 
regardless of formats, etc., is copied out after 
the Lex output 

Definitions intended for Lex are given 
before the first %% delimiter. Any line in this sec- 
tion not contained between %{ and %}, and begin- 
ing in column 1, is assumed to define Lex substitu- 
tion strings. The format of such lines is 

name translation 
and it causes the string given as a translation to be 



associated with the name. The name and transla- 
tion must be separated by at least one blank or 
tab, and the name must begin with a letter. The 
translation can then be called out by the {name} 
syntax in a rule. Using {D} for the digits and {E} 
for an exponent field, for example, might abbrevi- 
ate rules to recognize numbers: 
, , D • ; * J0-9J 

E [DEde][-+]?{D} + 

%% 

{D}+ printfCinteger"); 
{D} + "."{D}*({E})? | 
{D}*"."{D}+({E»? | 
{D} + {E} 

Note the first two rules for real numbers; both 
require a decimal point and contain an optional 
exponent field, but the first requires at least one 
digit before the decimal point and the second 
requires at least one digit after the decimal point. 
To correctly handle the problem posed by a For- 
tran expression such as 35.EQJ, which does not 
contain a real number, a context-sensitive rule 
such as 

[0 - 9] + /'7'EQ printfCinteger"); 
could be used in addition to the normal rule for 
integers. 

The definitions section may also contain 
other commands, including the selection of a host 
language, a character set table, a list of start condi- 
tions, or adjustments to the default size of arrays 
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within Lex itself for larger source programs. These 
possibilities are discussed below under "Summary 
of Source Format," section 12. 

7. Usage. 

There are two steps in compiling a Lex 
source program. First, the Lex source must be 
turned into a generated program in the host gen- 
eral purpose language. Then this program must be 
compiled and loaded, usually with a library of Lex 
subroutines. The generated program is on a file 
named lex.yy.c. The I/O library is defined in 
terms of the C standard library [6]. 

The C programs generated by Lex are 
slightly different on OS/370, because the OS com- 
piler is less powerful than the UNIX or GCOS 
compilers, and does less at compile time. C pro- 
grams generated on GCOS and UNIX are the 
same. 

UNIX. The library is accessed by the 
loader flag -//. So an appropriate set of com- 
mands is 

lex source cc lex.yy.c -11 
The resulting program is placed on the usual file 

cuout for later execution. To use Lex with Yacc 

see below. Although the default Lex I/O routines 

use the C standard library, the Lex automata 

themselves do not do so; if private versions of 

input, output and unput are given, the library can 

be avoided. 



8. Lex and Yacc. 

If you want to use Lex with Yacc, note that 
what Lex writes is a program named yylexQ, the 
name required by Yacc for its analyzer. Normally, 
the default main program on the Lex library calls 
this routine, but if Yacc is loaded, and its main 
program is used, Yacc will call yylexQ. In this 
case each Lex rule should end with 

return(token); 
where the appropriate token value is returned. An 
easy way to get access to Yacc's names for tokens 
is to compile the Lex output file as part of the 
Yacc output file by placing the line 
# include "lex.yy.c" 
in the last section of Yacc input. Supposing the 
grammar to be named "good" and the lexical rules 
to be named "better" the UNIX command 
sequence can just be: 

yacc good 

lex better 

cc y.tab.c -ly -11 
The Yacc library (-ly) should be loaded before 
the Lex library, to obtain a main program which 
invokes the Yacc parser. The generations of Lex 
and Yacc programs can be done in either order. 

9. Examples. 

As a trivial problem, consider copying an 
input file while adding 3 to every positive number 
divisible by 7. Here is a suitable Lex source pro- 
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gram 

%% 

int k; 
[0-9]+ { 

k = atoi(yytext); 
if (k%7 ==0) 

printed", k+ 3); 

else 

prinlf("%d",k); 

} 

to do just that. The rule [0-9]+ recognizes 
strings of digits; atoi converts the digits to binary 
and stores the result in k. The operator % 
(remainder) is used to check whether k is divisible 
by 7; if it is, it is incremented by 3 as it is written 
out. It may be objected that this program will 
alter such input items as 49.63 or X7 . Further- 
more, it increments the absolute value of all nega- 
tive numbers divisible by 7. To avoid this, just 
add a few more rules a Her the active one, as here: 
%% 

int k; 

-?l0-9]+ { 

k = aloi(yylext); 
printed", k%7 == 0?k+3 
} 

-?[0-9.]+ ECHO; 
[A-Za-z][A-Za-zO-9]+ ECHO; 
Numerical strings containing a "." or preceded by 
a letter will be picked up by one of the last two 



rules, and not changed. The if- else has been 
replaced by a C conditional expression to save 
space; the form a?b:c means "if a then b else c". 

For an example of statistics gathering, here 
is a program which histograms the lengths of 
words, where a word is defined as a string of 
letters. 

■ . .. ... v fait lengspOQ]; . ^ , 

%% 

[a - z] + lengsfyyleng] + + ; 
I 

\n ; 
%% 

yywrapO 
{ 

int i; 

printfl(" Length No. words\n"); 
for(i=0; K100; i++) 
if(lengs[i]>0) 

printr%5d%10d\n ,, ,i,lengs[i]); 
return(l); 

} 

This program accumulates the histogram, while 
producing no output. At the end of the input it 
p^rnls the table. The final statement return(l); 
indicates that Lex is to perform wrapup. If yywrap 
returns zero (false) it implies that further input is 
available and the program is to continue reading 
and processing. To provide a yywrap that never 
returns true causes an infinite loop. 
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As a larger example, here are some parts of 
a program written by N. L. Schryer to convert 
double precision Fortran to single precision For- 
tran. Because Fortran does not distinguish upper 
and lower case letters, this routine begins by 
defining a set of classes including both cases of 
each letter: 

b [bB] 
c [cC] 

z [zZJ 

An additional class recognizes white space: 
W [ \t]* 

The first rule changes "double precision" to "real", 
or "DOUBLE PRECISION" to "REAL". 
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n}{ 

print«yytext[0]=='d'? "real" : "REAL"); 

} 

Care is taken throughout this program to preserve 
the case (upper or lower) of the original program. 
The conditional operator is used to select the 
proper form of the keyword. The next rule copies 
continuation card indications to avoid confusing 
(hem with constants: 

t" "(t 0] ECHO; 
In the regular expression, the quotes surround the 
blanks. It is interpreted as "beginning of line, then 
five blanks, then anything but blank or zero." 
Note the two different meanings of f . There fol- 



low some rules to change double precision con- 
stants to ordinary floating constants. 

[0-9]+{W}{d}{W}[+ -]?{ W }[0-9]+ | 
[0-9]+{W}"."{W}{d}{W}[+ _]?{ W }[o_9] + | 

"."{W}[0-9]+{W}{d}{W}[+ -]?{W}[0-9]+ { 
/* convert constants */ 
for(p=yytext; *p != 0; p++) 

. { ■.-Ml . 

if (*p == 'd' || *p == 'D') 

*p=+ e - d ; 
ECHO; 
} 

After the floating point constant is recognized, it is 
scanned by the for loop to find the letter d or D. 
The program than adds 'e'—'d, which converts it 
to the next letter of the alphabet. The modified 
constant, now single-precision, is written out again. 
There follow a scries of names which must be 
respclled to remove their initial d. By using the 
array yytext the same action suffices for all the 
names (only a sample of a rather long list is given 
here). i. 

{d}{s}{i}{n} | 

{d}{c}{o}{s} | 

{d}{sKq}{r}{t} I 
{d}{a}{t}{a}{n} | 

{d}{f}{l}{o}{a}{t} prinlK"%s",yytext+l); 
Another list of nanics must have initial d changed 
to initial a'. 
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{d}{l}{o}{g} I 
{d}{l}{o}{g}10 | 
{d}{m}{i}{n}l | 
{d}{m}{a}{x}l { 

yytext[0] = + V - 'd'; 

ECHO; 

} 

And one routine must have initial d changed to 
initial r. 

{d}l{m}{a}{c}{h} {yytext[0] =+ V - 'd'; 

To avoid such names as dsinx being delected as 
instances of dsin, some final rules pick up longer 
words as identifiers and copy some surviving char- 
acters: 

[A-Za-z][A-Za-zO-9]* | 
[0-9]+ | 
\n | 

ECHO; 

i 

Note that this program is not complete; it does not 
deal with the spacing problems in Fortran or with 
(he use of keywords as identifiers. 
10. Left Context Sensitivity. 

Sometimes it is desirable to have several sets 
of lexical rules to be applied at different times in 
the input. For example, a compiler preprocessor 
might distinguish preprocessor statements and 
analyze them differently from ordinary statements. 
This requires sensitivity to prior context, and there 
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are several ways of handling such problems. The. 
+ operator, for example, is a prior context opera- 
tor, recognizing immediately preceding left context 
just as $ recognizes immediately following right 
context. Adjacent left context could be extended, 
to produce a facility similar to that for adjacent 
right context, but it is unlikely to be as useful, 
since often the relevant left context appeared some 
time earlier, such as at the beginning of a line. 

This section describes three means of deal- 
ing with different environments: a simple use of 
flags, when only a few rules change from one 
environment to another, the use of start conditions 
on rules, and the possibility of making multiple 
lexical analyzers all run together. In each case, 
there are rules which recognize the heed to change 
the environment in which the following, input text 
is analyzed, and set some parameter to reflect the 
change. This may be a flag explicitly tested by the 
user's action code; such a flag is the 'simplest way 
of dealing with the problem, since Lex is not 
involved at all. It may be more convenient, how- 
ever, to have Lex remember the flags as initial 
conditions on the rules. Any rule may be associ- 
ated with a start condition. It will only be recog- 
nized when Lex is in that start condition. The 

< i 

current start condition may be changed at any 
lime. Finally, if the sets of rules for Hie different 
environments are very dissimilar, clarity may be 
best achieved by writing several distinct lexical . 
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analyzers, and switching from one to another as 
desired. 

Consider the following problem: copy the 
input to the output, changing the word magic to 
first on every line which began with the letter a, 
changing magic to second on every line which 
began with the letter b, and changing magic to 
third on every line which began with the letter c. 
All other words and all other lines are left 
unchanged. 

These rules are so simple that the easiest 
way to do this job is with a flag: 
int flag; 



t a 


{flag = 


a; ECHO;} 


tb 


{flag = 


V; ECHO;} 


tc 


{flag = 


V; ECHO;} 


\n 


{flag = 


0 ; ECHO;} 


magic 


{ 





switch (flag) 
{ 

case 'a': prinlfCfirst"); break; 

case V: printf("second"); break; 

case V: printf("third"); break; 

default: ECHO; break; 

} 

} 

should be adequate. 

To handle the same problem with start con- 



ditions, each start condition must be introduced to 
Lex in the definitions section with a line reading 

%Start namel name2 ... 
where the conditions may be named in any order. 
The word Start may be abbreviated to s or $. 
The conditions may be referenced at the head of a 
rule with the <> brackets: , f 

v ■ * < n amel>expression 
is a rule which is only recognized when Lex is in 
the start condition namel. To enter a start condi- 
tion, execute the action statement 
BEGIN namel; 
which changes the start condition to namel. To 
resume the normal state, 

BEGIN 0; 

resets the initial condition of the Lex automaton 
interpreter. A rule may be active in several start 
conditions: 

<namel,name2,namc3> 
is a legal prefix. Any rule not beginning with the 
<> prefix operator is always active. 

The same example as before can be written: 

%STARTAA BBCC 
%% 

t a {ECHO; BEGIN AA;} 

tb {ECHO; BEGIN BB;} 

tc {ECHO; BEGIN CC;} 

\n {ECHO; BEGIN 0;} 

<AA>magic prinlCfirst"); 

<BB>magic printlCsecond"); 
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<CC>magic prititf(" third"); 

where the logic is exactly the same as in the previ- 
ous method of handling the problem, but Lex does 
the work rather than the user's code. 

11. Character Set 

The programs generated by Lex handle 
character I/O only through the routines input, out- 
put, and unput. Thus the character representation 
provided in these routines is accepted by Lex and 
employed to return values in yytexL For internal 
use a character is represented as a small integer 
which, if the standard library is used, has a value 
equal to the integer value of the bit pattern 
representing the character on the host computer. 
Normally, the letter a is represented as the same 
form as the character constant V. If this interpre- 
tation is changed, by providing I/O routines which 
translate the characters. Lex must be told about it, 
by giving a translation table. This table must be in 
the definitions section, and must be bracketed by 
lines containing only "%T". The table contains 
lines of the form 

{integer} {character string} 
which indicate the value associated with each char- 
acter. Thus the next example 
%T 

1 Aa 

2 Bb 



26 Zz 

27 An 

28 + 

29 - 

30 0 

31 1 

t 3$ t 9 

%T 

Sample character table, 
maps the lower and upper case letters together into 
the integers 1 through 26, newline into 27, + and 
— into 28 and 29, and the digits into 30 through 
39. Note the escape for newline. If a table is sup- 
plied, every character that is to appear either in 
the roles or in any valid input must be included in 
the table. No character may be assigned the 
number 0, and no character may be assigned a 
bigger number than the size of the hardware char- 
acter set 

12. Summary of Source Format 

The general form of a Lex source file is: 
{definitions} 
%% 

.{rules} 
%% 

{user subroutines} 
The definitions section contains a combination of 
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1) Definitions, in the form "name space trans- 
lation". 

2) Included code, in the form "space code". 

3) Included code, in the form 

%{ 
code 

%} 

4) Start conditions, given in the form 

%S namel name2 ... 

5) Character set tables, in the form 

%T 

number space character-string 
%T 

6) Changes to internal array sizes, in the form 

%x nnn 

where nnn is a decimal integer representing 
an array size and x selects the parameter as 
follows: 

Letter Parameter 

p positions 

n states 

e tree nodes 

a transitions 

k packed character classes 

o output array size 

Lines in the rules section have the form "expres- 
sion action" where the action may be continued 
on succeeding lines by using braces to delimit it. 
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Regular expressions in Lex use the following 
operators: 

x the character "x" 

"x" an "x", even if x is an operator. 

\x an "x", even if x is an operator. 

[xy] the character x or y. 

[x— z] the characters x, y or z. 

[*x] any character but x. 

any character but newline. 

*x an x at the beginning of a line. 

<y>x an x when Lex is in start condition y. 

x$ an x at the end of a line, 

x? an optional x. 

x* 0,1,2, ... instances of x. 

x+ 1,2,3, ... instances of x. 

x|y an x or a y. 

(x) an x. 

x/y an x but only if followed by y. 

{xx} the translation of xx from the definitions section. 

x{m,n} m through n occurrences of x 

13. Caveats and Bugs. 

There are pathological expressions which 
produce exponential growth of the tables when 
converted to deterministic machines; fortunately, 
they are rare. 

REJECT does not rescan the input; instead 
it remembers the results of the previous scan. This 
means that if a rule with trailing context is found, 
and REJECT executed, the user must not have 
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used unput to change the characters forthcoming 
from the input stream. This is the only restriction 
on the user's ability to manipulate the not-yet- 
processed input 
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SED — A Non-interactive Text Editor 

Lee E. McMahon 
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Introduction 

Sed is a non-interactive context editor designed to be especially useful in three cases: 

: 1) To edit files too large for comfortable interactive editing; 14 v 

2) To edit any size file when the sequence of editing commands is too complicated to be 

comfortably typed in interactive mode; 

3) To perform multiple 'global' editing functions efficiently in one pass through the input 

Since only a few lines of the input reside in core at one time, and no temporary files are used, the 
effective si/.c of file that can be edited is limited only by the requirement that the input and output 
fit simultaneously into available secondary storage. 

Complicated editing scripts can be created separately and given to sed as a command file. For 
complex edits, this saves considerable typing, and its attendant errors. Sed running from a com- 
mand file is much more efficient than any interactive editor known to the author, even if that editor 
can be driven by a pre- written script. 

The principal loss of functions compared to an interactive editor arc lack of relative addressing 
(because of the linc-at-a-tirnc operation), and lack of immediate verification that a command has 
done what was intended. 

Sed is a lineal descendant of the UNIX editor, ed. Because of the differences between interactive 
and non-interactive operation, considerable changes have been made between ed and sed; even 
confirmed users of ed will frequently be surprised (and probably chagrined), if they rashly use sed 
without reading Sections 2 and 3 of this document The most striking family resemblance between 
the two editors is in the class of patterns ('regular expressions') they rccogni/.c; the code for match- 
ing patterns is copied almost verbatim from the code for ed, and the description of regular expres- 
sions in Section 2 is copied almost verbatim from the UNIX Programmer's Manual[l]. (Both code 
and description were written by Dennis M. Ritchie.) 

1. Overall Operation 

Sed by default copies the standard input to the standard output, perhaps performing one or more 
editing commands on each line before writing it to the output. This behavior may be modified by 
flags on the command line; sec Section 1.1 below. 

The general format of an editing command is: 

[address 1 ,addrcss2][function][argumcnts] 

One or both addresses may be omitted; the format of addresses is given in Section 2. Any number 
of blanks or tabs may separate the addresses from the function. The function must be present; the 
available commands arc discussed in Section 3. The arguments may be required or optional, 
according to which function is given; again, they arc discussed in Section 3 under each individual 
function. 

Tab characters and spaces at the beginning of lines arc ignored. 
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1.1. Command-line Flags 

Three flags are recognized on the command line: 

-n: tells sed not to copy alt lines, but only those specified by p functions or p flags after s 

functions (sec Section 3.3); 
-e: tells sed to take the next argument as an editing command; 

-f: tells sed to take the next argument as a file name; the file should contain editing com- 
mands, one to a line. 

1.2. Order of Application of Editing Commands 

Before any editing is done (in fact, before any input file is even opened), all the editing commands 
are compiled into a form which will be moderately efficient during the execution phase (when the 
commands are actually applied to lines of the input file). The commands arc compiled in the order 
in which they are encountered; this is generally the order in which they will be attempted at execu- 
tion time. The commands are applied one at a time; the input to each command is the output of 
all preceding commands. 

The default linear order of application of editing commands can be changed by the flow-of-control 
commands, t and b (sec Section 3). Even when the order of application is changed by these com- 
mands, it is still true that the input line to any command is the output of any previously applied 
command. 

1.3. Pattern-space 

The range of pattern matches is called the pattern space. Ordinarily, the pattern space is one line 
of the input text, but more than one line can be read into the pattern space by using the N com- 
mand (Section 3.6.). 

1.4. Examples 

Examples are scattered throughout the text. Except where otherwise noted, the examples all 
assume the following input text: 

In Xanadu did Kubla Khan 
A stately pleasure dome decree: 
Where Alph, the sacred river, ran 
Through caverns measureless to man 
Down to a sunless sea. 

(In no case is the output of the sed commands to be considered an improvement on Coleridge.) 

Example: 
The command 
2q 

will quit after copying the first two lines of the input. The output will be: 

In Xanadu did Kubla Khan 
A stately pleasure dome decree: 

2. ADDRESSES: Selecting lines for editing 

Lines in the input filc(s) to which editing commands arc to be applied can be selected by addresses. 
Addresses may be cither line numbers or context addresses. 

The application of a group of commands can be controlled by one address (or address-pair) by 
grouping the commands with curly braces C{ J'XScc. 3.6.). 
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2.1. Line-number Addresses 

A line number is a decimal integer. As each line is read from the input, a line-number counter is 
incremented; a line-number address matches (selects) the input line which causes the internal 
counter to equal the address line-number. The counter runs cumulatively through multiple input 
files; it is not reset when a new input file is opened 

As a special case, the character $ matches the last line of the last input file. 

2.2. Context Addresses 

A context address is a pattern ('regular expression') enclosed in slashes (V). The regular expres- 
sions recognized by sed arc constructed as follows: 

1) An ordinary character (not one of those discussed below) is a regular expression, and 

matches that character. 

2) A circumflex *t' at the beginning of a regular expression matches the null character at 

the beginning of a line. 

3) A dollar-sign '$' at the end of a regular expression matches the null character at the end 

of a line. 

4) The characters *\n' match an imbedded ncwline character, but not the ncwline at the 

end of the pattern space. 

5) A period V matches any character except the terminal ncwline of the pattern space. 

6) A regular expression followed by an asterisk *** matches any number (including 0) of 

adjacent occurrences of the regular expression it follows. 

7) A string of characters in square brackets '[ ]' matches any character in the string, and no 

others. If, however, the first character of the string is circumflex *t* t the regular 
expression matches any character except die characters in the string and the termi- 
nal ncwline of the pattern space. 

8) A concatenation of regular expressions is a regular expression which matches flic con- 

catenation of strings matched by the components of the regular expression. 

9) A regular expression between the sequences *\(* and '\)' is identical in effect to the una- 

dorned regular expression, but has side-effects which arc described under the s 
command below and specification 10) immediately below. 

10) The expression 'Vf means die same string of characters matched by an expression 

enclosed in *\C and 'W earlier in the same pattern. Here d is a single digit; the 
string specified is that beginning with the </th occurrence of '\(' counting from the 
left. For example, the expression matches a line beginning with two 

repeated occurrences of the same string. 

11) The null regular expression standing alone (e.g., '//*) is equivalent to the last regular 

expression compiled. 

To use one of the special characters (t $ . * [ ] \ /) as a literal (to match an occurrence of itself in 
the input), precede the special character by a backslash V. 

For a context address to 'match' the input requires that the whole pattern within the address match 
some portion of the pattern space. 

2.3. Number of Addresses . 

The commands in the next section can have 0, 1, or 2 addresses. Under each command the max- 
imum number of allowed addresses is given. For a command to have more addresses than the 
maximum allowed is considered an error. 

If a command has no addresses, it is applied to every line in the input 

If a command has one address, it is applied to all lines which match that address. 

If a command has two addresses, it is applied to the first line which matches the first address, and 
to all subsequent lines until (and including) the first subsequent line which matches the second 
address. Then an attempt is made on subsequent lines to again match the first address, and the 
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process is repeated. 

Two addresses are separated by a comma. 



Examples: 

/an/ matches lines 1, 3, 4 in our sample text 

/an.*an/ matches line 1 

/*an/ matches no lines 

/./ matches all lines 

A./ matches line 5 

/r*an/ matches lines 1,3, 4 (number = zero!) 

A(an\).*\l/ matches line 1 



3; FUNCTIONS; , s % . . . 

All functions arc named by a single character. In the following summary, the maximum number of 
allowable addresses is given enclosed in parentheses, then the single character function name, possi- 
ble arguments enclosed in angles (< », an expanded English translation of the single-character 
name, and finally a description of what each function docs. The angles around the arguments are 
not part of the argument, and should not be typed in actual editing commands. 

3.1. Whole-line Oriented Functions 

(2)d ~ delete lines 

The d function deletes from the file (docs not write to the output) all those lines 
matched by its addrcss(es). 

It also has the side effect that no further commands are attempted on die corpse of 
a deleted line; as soon as the d function is executed, a new line is read from the 
input, and die list of editing commands is re-started from the beginning on the 
new line. 

(2)n - next line 

The n function reads the next line from the input, replacing the current line. The 
current line is written to the output if it should be. The list of editing commands 
is continued following the n command. 

(Da\ 

<tcxt> - append lines 

The a function causes the argument <tcxt> to be written to the output after the line 
matched by its address. The a command is inherently multi-line; a must appear at 
the end of a line, and <tcxt> may contain any number of lines. To preserve the 
onc-command-to-a-linc fiction, the interior ncwlincs must be hidden by a backslash 
character CV) immediately preceding the ncwlinc. The <tcxt> argument is ter- 
minated by the first unhidden ncwlinc (the first one not immediately preceded by 
backslash). 

Once an a function is successfully executed, <tcxt> will be written to the output 
regardless of what later commands do to the line which triggered it. The trigger- 
ing line may be deleted entirely; <tcxt> will still be written to the output 

The <tcxt> is not scanned for address matches, and no editing commands are 
attempted on it. It docs not cause any change in the line-number counter. 

(Di\ 

<text> ~ insert lines 

The i function behaves identically to the a function, except that <tcxt> is written 
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to the output before the matched line. All other comments about the a function 
apply to the / function as well. 

(2)c\ 

<tcxt> -- change lines 

The c function deletes the lines selected by its addrcss(cs), and replaces them with 
the lines in <tcxt>. Like a and £ c must be followed by a newlinc hidden by a 
backslash; and interior new lines in <tcxt> must be hidden by backslashes. 

The c command may have two addresses, and therefore select a range of lines. If 
it docs, all the lines in the range arc deleted, but only one copy of <tcxt> is written 
to the output, not one copy per line deleted. As with a and /, <text> is not scanned 
for address matches, and no editing commands arc attempted on it. It docs not 
change the line-number counter. 

After a line has been deleted by a c function, no further commands are attempted 
on the corpse. 

If text is appended after a line by a or r functions, and the line is subsequently 
changed, the text inserted by the c function will be placed before the text of the a 
or r functions. (The r function is described in Section 3.4.) 

Nole: Within the text put in the output by these functions, leading blanks and tabs will disappear, 
as always in sed commands: To get leading blanks and tabs into the output, precede the first 
desired blank or tab by a backslash; the backslash will not appear in the output. 

Example: 

The list of editing commands: 

n 

a\ 

xxxx 

d 

applied to our standard input, produces: 

In Xanadu did Kubhla Khan 
XXXX 

Where Alph, the sacred river, ran 
XXXX 

Down to a sunless sea. 

In this particular case, the same effect would be produced by cither of the two following command 
lists: 

n n 
i\ c\ 
XXXX XXXX 
d 

3.2. Substitute Function 

One very important function changes parts of lines selected by a context search within the line. 

(2)s<pattcrnXreplaccmcntXflags> -- substitute 

The 5 function replaces part of a line (selected by <pattcrn>) with <rcplaccmcnt>. 
It can best be read: 

Substitute for <pattern>, < replacement 

The <pattern> argument contains a pattern, exactly like the patterns in addresses 
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(sce 2.2 above). The only difference between <pattern> and a context address is 
that the context address must be delimited by slash (V) characters; <pattcrn> may 
be delimited by any character other than space or newline. 

By default, only the first string matched by <pattcrn> is replaced, but sec the g flag 
below. 

The <rcplacement> argument begins immediately after the second delimiting char- 
acter of <pattern>, and must be followed immediately by another instance of the 
delimiting character. (Thus there are exactly three instances of the delimiting char- 
acter.) 

The <rcplaccment> is not a pattern, and the characters which arc special in pat- 
terns do not have special meaning in <rcplaccmcnt>. Instead, other characters are 
special: 

& is replaced by the string matched by <pattern> 

\d (where d is a single digit) is replaced by the dth substring matched by 
parts of <pattcrn> enclosed in l \(' and '\)\ If nested substrings 
occur in <pattcrn>, the d\h is determined by counting opening 
delimiters C\('). 

As in patterns, special characters may be made literal by preced- 
ing them with backslash ('V). 

The <flags> argument may contain the following flags: 

g -- substitute <replaccmcnt> for all (non-overlapping) instances of <pat- 
tcrn> in the line. After a successful substitution, the scan for the 
next instance of <pattcrn> begins just after the end of the inserted 
characters; characters put into the line from <rcplaccmcnt> arc not 
rescanncd. 

p -- print the line if a successful replacement was done. The /; flag causes 
the line to be written to the output if and only if a substitution 
was actually made by the s function. Notice that if several s func- 
tions, each followed by a p flag, successfully substitute in the same 
input line, multiple copies of the line will be written to the out- 
put: one for each successful substitution. 

w <filcnamc> -- write the line to a file if a successful replacement was 
done. The w flag causes lines which arc actually substituted by 
the s function to be written to a file named by <filcname>. If 
<filcnamc> exists before sed is run, it is overwritten; if not, it is 
created. 

A single space must separate w and <filcname>. 

The possibilities of multiple, somewhat different copies of one 
input line being written are the same as for p. 

A maximum of 10 difFcrcnt file names may be mentioned after w 
flags and »v functions (sec below), combined. 

Examples: 

The following command, applied to our standard input, 

s/to/by/w changes 
produces, on the standard output: 
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In Xanadu did Kubhla Khan 
A stately pleasure dome decree: 
Where Alph, the sacred river, ran 
Through caverns measureless by man 
Down by a sunless sea. 

and, on the file 'changes': 

Through caverns measureless by man 
Down by a sunless sea. 

If the nocopy option is in effect, the command: 

s/[.,;?:]/*P&*/gp 

produces: 

A stately pleasure dome dccrec*P:* 
Where Alph*P,* the sacred river*P,* ran 
Down to a sunless sea*P.* 

Finally, to illustrate the effect of the g flag, the command: 

/X/s/an/AN/p 
produces (assuming nocopy mode): 

In XANadu did Kubhla Khan 
and the command: 

/X/s/an/AN/gp 
produces: 

In XANadu did Kubhla KhAN 

3.3. Input-output Functions 
(2)p --print 

The print function writes the addressed lines to the standard output file. They are 
written at the time the p function is encountered, regardless of what succeeding 
editing commands may do to the lines. 

(2)w <filcnamc> ~ write on <filcnamc> 

The write function writes the addressed lines to the file named by <filcnamc>. If 
the file previously existed, it is overwritten; if not, it is created. The lines arc writ- 
ten exactly as they exist when the write function is encountered for each line, 
regardless of what subsequent editing commands may do to them. 

Exactly one space must separate the w and <filcnamc>. 

A maximum of ten different files may be mentioned in write functions and w flags 
after s functions, combined. 

(l)r <filcnamc> read the contents of a file 

The read function reads the contents of <filcnamc>, and appends them after the 
line matched by the address. The file is read and appended regardless of what 
subsequent editing commands do to the line which matched its address. If r and a 
functions are executed on the same line, the text from the a functions and the r 
functions is written to the output in die order that the functions arc executed. 

Exactly one space must separate the r and <filcnamc>. If a file mentioned by a r 
function cannot be opened, it js considered a null ftlc, not an error, and no diag- 
nostic is given. 
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NOTE: Since there is a limit to the number of files that can be opened simultaneously, care should 
be taken that no more than ten files be mentioned in w functions or flags; that number is reduced 
by one if any r functions are present. (Only one read file is open at one time.) 

Examples 

Assume that the file 'notel' has the following contents: 

Note: Kubla Khan (more properly Kublai Khan; 1216-1294) was the grandson 
and most eminent successor of Genghiz (Chingiz) Khan, and founder of the 
Mongol dynasty in China. 

Then the following command: 

/Kubla/r notel 

produces: 

In Xanadu did Kubla Khan 

Note: Kubla Khan (more properly Kublai Khan; 1216-1294) was the grandson 
and most eminent successor of Genghiz (Chingiz) Khan, and founder of the 
Mongol dynasty in China. 

A stately pleasure dome decree: 

Where Alph, the sacred river, ran 

Through caverns measureless to man 

Down to a sunless sea. 

3.4. Multiple Input-line Functions 

Three functions, all spelled with capital letters, deal specially with pattern spaces containing imbed- 
ded ncwlincs; they are intended principally to provide pattern matches across lines in the input 

(2)N --' Next line 

The next input line is appended to the current line in the pattern space; the two 
input lines arc separated by an imbedded ncwlinc. Pattern matches may extend 
across the imbedded ncwlinc(s). 

(2)13 - Delete first part of the pattern space 

Delete up to and including the first ncwlinc character in the current pattern space. 
If the pattern space becomes empty (the only ncwlinc was the terminal ncwlinc), 
read another line from the input. In any case, begin the list of editing commands 
again from its beginning. 

(2)P - Print first part of the pattern space 

Print up to and including the first ncwlinc in the pattern space. 

The /' and D functions are equivalent to their lower-case counterparts if there arc no imbedded 
ncwlincs in the pattern space. 

3.5. Mold and Get Functions 

Four functions save and retrieve part of the input for possible later use. 

(2)h— hold pattern space 

The h functions copies the contents of the pattern space into a hold area (destroy- 
ing the previous contents of the hold area). 

(2)11 - Mold pattern space 

The H function appends the contents of the pattern space to the contents of the 
hold area; the former and new contents arc separated by a ncwlinc. 



(2)g -- get contents of hold area 

The g function copies the contents of the hold area into the pattern space (destroy- 
ing the previous contents of the pattern space). 

(2)G -- Get contents of hold area 

The G function appends the contents of the hold area to the contents of the pattern 
space; the. former and new contents are separated by a ncwline. 

(2)x -- exchange 

The exchange command interchanges the contents of the pattern space and the 
hold area. 

Example 

The commands 

lh 

Is/ did.*// 

lx 

G 

s/\n/ :/ 

applied to our standard example, produce: 

In Xanadu did Kubla Khan :In Xanadu 
A stately pleasure dome decree: :In Xanadu 
Where Alph, the sacred river, ran :In Xanadu 
Through caverns measureless to man :In Xanadu 
Down to a sunless sea. :In Xanadu 

3.6. Flow-of-Control Functions 

These functions do no editing on the input lines, but control the application of functions to the 
lines selected by the address part. 

(2)! -- Don't 

The Don't command causes the next command (written on the same line), to be 
applied to all and only those input lines not selected by the adrcss part. 

(2){ Grouping 

The grouping command causes the next set of commands to be applied (or not 
applied) as a block to the input lines selected by the addresses of the grouping 
command. The first of the commands under control of the grouping may appear 
on the same line as the or on the next line. 

The group of commands is terminated by a matching standing on a line by 
itself. 

Groups can be nested. 

(0):<labcl> -- place a label 

The label function marks a place in the list of editing commands which may be 
referred to by b and / functions. The <labcl> may be any sequence of eight or 
fewer characters; if two different colon functions have identical labels, a compile 
time diagnostic will be generated, and no execution attempted. 

(2)b<labcl> - branch to label 



The branch function causes the sequence of editing commands being applied to 
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ihc current input line to be restarted immediately aOer the place wlicrc a colon 
function with the same <labcl> was encountered. If no colon function with the 
same label can be found after all the editing commands have been compiled, a 
compile time diagnostic is produced, and no execution is attempted. 

A b function with no <label> is taken to be a branch to the cnd.of the list of edit- 
ing commands; whatever should be done with the current input line is done, and 
another input line is read; the list of editing commands is restarted from the begin- 
ning on the new line. 

(2)t<labcl> ~ test substitutions 

The / function tests whether any successful substitutions have been made on the 
current input line; if so, it branches to <label>; if not, it docs nothing. The flag 
which indicates that a successful substitution has been executed is reset by: 

1) reading a new input line, or 

2) executing a t function. 

3.7. Miscellaneous Functions 
(1)= -• equals 

'Hie = function writes to the standard output the line number of the line matched 
by its address. 

(l)q -- quit 

The q function causes the current line to be written to the output (if it should be), 
any appended or read text to be written, and execution to be terminated. 

Reference 

[1] Ken Thompson and Dennis M. Ritchie, The UNIX Programmer's Manual. Bell Laboratories, 
1978. 
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ABSTRACT 

Awk is a programming language whose basic operation is to search a set 
of files for patterns, and to perform specified actions upon lines or fields of 
lines which contain instances of those patterns. Awk makes certain data selec- 
tion and transformation operations easy to express; for example, the awk pro- 
gram 

length > 72 

prints all input lines whose length exceeds 72 characters; the program 

NF % 2 = - 0 

prints all lines with an even number of fields; and the program 

{ $1 = log($1); print I 

replaces the first field of each line by its logarithm. 

Awk patterns may include arbitrary boolean combinations of regular 
expressions and of relational operators on strings, numbers, fields, variables, 
and array elements. Actions may include the same pattern-matching construc- 
tions as in patterns, as well as arithmetic and siring expressions and assign- 
ments, if-else, while, for statements, and multiple output streams. 

This report contains a user's guide, a discussion of the design and imple- 
mentation of awk, and some timing statistics. 
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1. Introduction 

Awk is a programming language designed 
to make many common information retrieval and 
text manipulation tasks easy to state and to per- 
form. 

The basic operation of awk is to scan a set 
of input lines in order, searching for lines which 
match any of a set of patterns which the user has 
specified. For each pattern, an action can be 
specified; this action will be performed on each 
line that matches the pattern. 

Readers familiar with the UNixt program 
grep* will recognize the approach, although in 
awk the patterns may be more general than in 
grep, and the actions allowed are more involved 
than merely printing the matching line. For 
example, the awk program 

{print $3, $2} 

prints the third and second columns of a table in 
that order. The program 

$2 ~ /AIBIC/ 

prints all input lines with an A, B, or C in the 
second field. The program 

$1 != prev { print; prev $1 } 

prints all lines in which the first field is different 
from the previous first field. 

1.1. Usage 

The command 

awk program [files] 

executes the awk commands in the string pro- 
gram on the set of named files, or on the stan- 
dard input if there are no files. The statements 
can also be placed in a file pfile, and executed by 
the command 
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awk -f pfile [files] 

1.2. Program Structure 

An awk program is a sequence of state- 
ments of the form. 

pattern [ action ) 
pattern { action \ 

Each line of input is matched against each of the 
patterns in turn. For each pattern that matches, 
the associated action is executed. When all the 
patterns have been tested, the next line is 
fetched and the matching starts over. 

Either the pattern or the action may be left 
out, but not both. If there is no action for a pat- 
tern, the matching line is simply copied to the 
output. (Thus a line which matches several pat- 
terns can be printed several times.) If there is no 
pattern for an action, then the action is per- 
formed for every input line. A line which 
matches no pattern is ignored. 

Since patterns and actions are both 
optional, actions must be enclosed in braces to 
distinguish them from patterns. 

1.3. Records and Fields 

Awk input is divided into "records" ter- 
minated by a record separator. The default 
record separator is a newline, so by default awk 
processes its input a line at a time. The number 
of the current record is available in a variable 
named NR. 

Each input record is considered to be 
divided into "fields." Fields are normally 
separated by white space — blanks or tabs — but 
the input field separator may be changed, as 
described below. Fields are referred to as $1, 
$2, and so forth, where $1 is the first field, and 
$0 is the whole input record itself. Fields may 



be assigned to. The number of fields in the 
current record is available in a variable named 
NF. 

The variables FS and RS refer to the input 
field and record separators; they may be changed 
at any time to any single character. The optional 
command-line argument — Fc may also be used 
to set FS to the character c. 

If the record separator is empty, an empty 
input line is taken as the record separator, and 
blanks, tabs and newlines are treated as field 
separators. 

The variable FILENAME contains the 
name of the current input file. 

1.4. Printing 

An action may have no pattern, in which 
case the action is executed for all lines. The 
simplest action is to print some or all of a record; 
this is accomplished by the awk command print. 
The awk program 

{ print ) 

prints each record, thus copying the input to the 
output intact. More useful is to print a field or 
fields from each record. For instance, 

print $2, $1 

prints the first two fields in reverse order. Items 
separated by a comma in the print statement will 
be separated by the current output field separator 
when output. Items not separated by commas 
will be concatenated, so 

print $1 $2 

runs the first and second fields together. 

The predefined variables NF and NR can 
be used; for example 

{ print NR, NF, $0 } 

prints each record preceded by the record 
number and the number of fields. 

Output may be diverted to multiple files; 
the program 

{ print $1 >"foo1"; print $2 > M foo2 M } 

writes the first field, $1, on the file fool, and 
the second field on file foo2. The > > notation 
can also be used: 

print $1 >>"foo" 

appends the output to' the file foo. (In each 
case, the output files are created if necessary.) 
The file name can be a variable or a field as well 
as a constant; for example, 

print $1 >$2 



uses the contents of field 2 as a file name. 

Naturally there is a limit on the number of 
output files; currently it is 10. 

Similarly, output can be piped into another 
process (on UNIX only); for instance, 

print I "mall bwk" 

mails the output to bwk. 

The variables OFS and ORS may be used 
to change the current output field separator and 
output record separator. The output record 
separator is appended to the output of the print 
statement. 

Awk also provides the printf statement for 
output formatting: 

printf format expr, expr, ... 

formats the expressions in the list according to 
the specification in format and prints them. For 
example, 

printf "%8.2f %10ld\n", $1, $2 

prints $1 as a floating point number 8 digits 
wide, with two after the decimal point, and $2 as 
a 10-digit long decimal number, followed by a 
newline. No output separators are produced 
automatically; you must add them yourself, as in 
this example. The version of printf is identical 
to that used with C. 2 

2. Patterns 

A pattern in front of an action acts as a 
selector that determines whether the action is to 
be executed. A variety of expressions may be 
used as patterns: regular expressions, arithmetic 
relational expressions, string-valued expressions, 
and arbitrary boolean combinations of these. 

2.1. BEGIN and END 

The special pattern BEGIN matches the 
beginning of the input, before the first record is 
read. The pattern END matches the end of the 
input, after the last record has been processed. 
BEGIN and END thus provide a way to gain con- 
trol before and after processing, for initialization 
and wrapup. 

As an example, the field separator can be 
set to a colon by 

BEGIN { FS - M : H } 
... rest of program ... 

Or the input lines may be counted by 

END { print NR } 

If BEGIN is present, it must be the first pattern; 
END must be the last if used. 



2.2. Regular Expressions 

The simplest regular expression is a literal 
string of characters enclosed in slashes, like 

/smith/ 

This is actually a complete awk program which 
will print all lines which contain any occurrence 
of the name "smith". If a line contains "smith" 
as part of a larger word, it will also be printed, as 
in 

blacksmithing 

Awk regular expressions include the regu- 
lar expression forms found in the UNIX text edi- 
tor ed l and grep (without back-referencing). In 
addition, awk allows parentheses for grouping, I 
for alternatives, + for "one or more", and ? for 
"zero or one", all as in lex. Character classes 
may be abbreviated: [a-zA-ZO-9] is the set 
of ail letters and digits. As an example, the awk 
program 

/[AaJhol[Ww]einbergerl[KkJernighan/ 

will print all lines which contain any of the 
names "Aho," "Weinberger" or "Kernighan," 
whether capitalized or not. 

Regular expressions (with the extensions 
listed above) must be enclosed in slashes, just as 
in ed and sed. Within a regular expression, 
blanks and the regular expression metacharacters 
are significant. To turn of the magic meaning of 
one of the regular expression characters, precede 
it with a backslash. An example is the pattern 

A/.A// 

which matches any string of characters enclosed 
in slashes. 

One can also specify that any field or vari- 
able matches a regular expression (or does not 
match it) with the operators — and !—. The 
program 

$1 ~ /(jJJohn/ 

prints all lines where the first field matches 
"john" or "John." Notice that this will also 
match "Johnson", "St. Johnsbury", and so on. 
To restrict it to exactly (jJ)ohn, use 

$1 ~ /*fjJ]ohn$/ 

The caret * refers to the beginning of a line or 
field; the dollar sign $ refers to the end. 

2.3. Relational Expressions 

An awk pattern can be a relational expres- 
sion involving the usual relational operators <, 
<=, — — , !-=, >=, and >. An example is 



$2 > $1 + 100 

which selects lines where the second field is at 
least 100 greater than the first field. Similarly, 

NF % 2 0 

prints lines with an even number of fields. 

In relational tests, if neither operand is 
numeric, a string comparison is made; otherwise 
it is numeric. Thus, 

$1 >= "s" 

selects lines that begin with an s, t, u, etc. In 
the absence of any other information, fields are 
treated as strings, so the program 

$1 > $2 

will perform a string comparison. 

2.4. Combinations of Patterns 

A pattern can be any boolean combination 
of patterns, using the operators II (or), && 
(and), and ! (not). For example, 

$1 >= "s" && $1 < T && $1 != "smith- 
selects lines where the first field begins with "s", 
but is not "smith". && and II guarantee that 
their operands will be evaluated from left to 
right; evaluation stops as soon as the truth or 
falsehood is determined. 

2.5. Pattern Ranges 

The "pattern" that selects an action may 
also consist of two patterns separated by a 
comma, as in 

patl, pat2 { ... } 

In this case, the action is performed for each line 
between an occurrence of patl and the next 
occurrence of pat2 (inclusive). For example, 

/start/, /stop/ 

prints all lines between start and stop, while 

NR 100, NR -r 200 { ... } 

does the action for lines 100 through 200 of the 
input. 

3. Actions 

An awk action is a sequence of action 
statements terminated by newlines or semi- 
colons. These action statements can be used to 
do a variety of bookkeeping and string manipu- 
lating tasks. 



3.1. Built-in Functions 

Awk provides a "length" function to com- 
pute the length of a string of characters. This 
program prints each record, preceded by its 
length: 

{print length, $0} 

length by itself is a "pseudo-variable" which 
yields the length of the current record; 
length (argument) is a function which yields the 
length of its argument, as in the equivalent 

{print length($0), $0} 

The argument may be any expression. 

Awk also provides the arithmetic functions 
sqrt, log, exp, and int, for square root, base e 
logarithm, exponential, and integer part of their 
respective arguments. 

The name of one of these built-in func- 
tions, without argument or parentheses, stands 
for the value of the function on the whole 
record. The program 

length < 10 II length > 20 

prints lines whose length is less than 10 or 
greater than 20. 

The function substr(s, m, n) produces the 
substring of s that begins at position m (origin 
1 ) and is at most n characters long. If n is omit- 
ted, the substring goes to the end of s. The 
function index(s1,s2) returns the position 
where the string s2 occurs in s1, or zero if it 
does not. 

The function sprintf(f, e1, e'2, ...) produces 
the value of the expressions e1, e2, etc., in the 
printf format specified by f. Thus, for example, 

x - sprintf("%8.2f %1 Old". $1. $2) 

sets x to the string produced by formatting the 
values of $1 and $2. 

3.2. Variables, Expressions, and Assign- 
ments 

Awk variables take on numeric (floating 
point) or string values according to context. For 
example, in 

x = 1 

x is clearly a number, while in 

x — "smith" 

it is clearly a string. Strings are converted to 
numbers and vice versa whenever context 
demands it. For instance, 

x - "3" + "4" 

assigns 7 to x. Strings which cannot be inter- 



preted as numbers in a numerical context will 
generally have numeric value zero, but it is 
unwise to count on this behavior. 

By default, variables (other than built-ins) 
are initialized to the null string, which has 
numerical value zero; this eliminates the need 
for most BEGIN sections. For example, the 
sums of the first two fields can be computed by 

{ s1 +- $1; s2 += $2 } 
END { print s1, s2 } 

Arithmetic is done internally in floating 
point. The arithmetic operators are +, — , *, /, 
and % (mod). The C increment +4- and decre- 
ment operators are also available, and so 

are the assignment operators +=, — =, * = , 
/—, and % = . These operators may all be used 
in expressions. 

3.3. Field Variables 

Fields in awk share essentially all of the 
properties of variables — they may be used in 
arithmetic or string operations, and may be 
assigned to. Thus one can replace the first field 
with a sequence number like this: 

I $1 - NR; print } 
or accumulate two fields into a third, like this: 

{ $1 =$2 4- $3; print $0.) 

or assign a string to a field: 

{ if ($3 > 1000) 
$3 - "too big" 
print 

} 

which replaces the third field by "too big" when 
it is, and in any case prints the record. 

Field references may be numerical expres- 
sions, as in 

{ print $i, $0 + 1), $(i + n) } 

Whether a field is deemed numeric or string 
depends on context; in ambiguous cases like 

if ($1 $2) ... 

fields are treated as strings. 

Each input line is split into fields automati- 
cally as necessary. It is also possible to split any 
variable or string into fields: 

n = split(s, array, sep) 

splits the the string s into array[1], arraytn]. 
The number of elements found is returned. If 
the sep argument is provided, it is used as the 
field separator; otherwise FS is used as the 
separator. 



3.4. String Concatenation 

Strings may be concatenated. For example 

length($1 $2 $3) 

returns the length of the first three fields. Or in 
a print statement, 

print $1 "is " $2 

prints the two fields separated by " is Vari- 
ables and numeric expressions may also appear 
in concatenations. 

3.5. Arrays 

Array elements are not declared; they 
spring into existence by being mentioned. Sub- 
scripts may have any non-null value, including 
non-numeric strings. As an example of a con- 
ventional numeric subscript, the statement 

x[NRj - $0 

assigns the' current input record to the NR-th ele- 
ment of the array x. In fact, it is possible in 
principle (though perhaps slow) to process the 
entire input in a random order with the awk pro- 
gram 

{ x[NR] - $0 } 

END { ... program ... \ 

The first action merely records each input line in 
the array x. 

Array elements may be named by non- 
numeric values, which gives awk a capability 
rather like the associative memory of Snobol 
tables. Suppose the input contains fields with 
values like apple, orange, etc. Then the pro- 
gram 

/apple/ { x["apple"] + + ) 

/orange/ { x["orange"] + + ) 

END { print x["apple"], x["orange"] } 

increments counts for the named array elements, 
and prints them at the end of the input. 

3.6. Flow-of-Control Statements 

Awk provides the basic flow-of-control 
statements if-else, while, for, and statement 
grouping with braces, as in C. We showed the if 
statement in section 3.3 without describing it. 
The condition in parentheses is evaluated; if it is 
true, the statement following the if is done. The 
else part is optional. 

The while statement is exactly like that of 
C. For example, to print all input fields one per 
line. 



i = 1 

while (i < = NF) { 
print $i 

+ +i 

I 

The for statement is also exactly that of C: 

for (i - 1; i < = NF; 
print $i 

does the same job as the while statement above. 

There is an alternate form of the for state- 
ment which is suited for accessing the elements 
of an associative array: 

for (i in array) 

statement 

does statement with i set in turn to each element 
of array. The elements are accessed in an 
apparently random order. Chaos will ensue if i is 
altered, or if any new elements are accessed dur- 
ing the loop. 

The expression in the condition part of an 
if, while or for can include relational operators 
like <, < =, >, > =, = = ("is equal to"), and 
! = ("not equal to"); regular expression matches 
with the match operators — and !~; the logical 
operators II, &&, and !; and of course 
parentheses for grouping. 

The break statement causes an immediate 
exit from an enclosing while or for; the con- 
tinue statement causes the next iteration to 
begin. 

The statement next causes awk to skip 
immediately to the next record and begin scan- 
ning the patterns from the top. The statement 
exit causes the program to behave as if the end 
of the input had occurred. 

Comments may be placed in awk pro- 
grams: they begin with the character # and end 
with the end of the line, as in 

print x, y# this is a comment 
4. Design 

The Unix system already provides several 
programs that operate by passing input through a 
selection mechanism. Grep, the first and sim- 
plest, merely prints all lines which match a single 
specified pattern. Egrep provides more general 
patterns, i.e., regular expressions in full general- 
ity; Jgrep searches for a set of keywords with a 
particularly fast algorithm. Sed ] provides most 
of the editing facilities of the editor «/, applied 
to a stream of input. None of these programs 
provides numeric capabilities, logical relations, or 
variables. 



Lex 3 provides general regular expression 
recognition capabilities, and, by serving as a C 
program generator, is essentially open-ended in 
its capabilities. The use of lex, however, 
requires a knowledge of C programming, and a 
lex program must be compiled and loaded before 
use, which discourages its use for one-shot appli- 
cations. 

Awk is an attempt to fill in another part of 
the matrix of possibilities. It provides general 
regular expression capabilities and an implicit 
input/output loop. But it also provides con- 
venient numeric processing, variables, more gen- 
eral selection, and control flow in the actions. It 
does not require compilation or a knowledge of 
C. Finally, awk provides a convenient way to 
access fields within lines; it is unique in this 
respect. 

Awk also tries to integrate strings and 
numbers completely, by treating all quantities as 
both string and numeric, deciding which 
representation is appropriate as late as possible. 
In most cases the user can simply ignore the 
differences. 

Most of the effort in developing awk went 
into deciding what awk should or should not do 
(for instance, it doesn't do string substitution) 
and what the syntax should be (no explicit 
operator for concatenation) rather than on writ- 
ing or debugging the code. We -have tried to 
make the syntax powerful but easy to use and 
well adapted to scanning files. For example, the 
absence of declarations and implicit initializa- 
tions, while probably a bad idea for a general- 
purpose programming language, is desirable in a 
language that is meant to be used for tiny pro- 
grams that may even be composed on the com- 
mand line. 

In practice, awk usage seems to fall into 
two broad categories. One is what might be 
called "report generation" — processing an input 
to extract counts, sums, sub-totals, etc. This 
also includes the writing of trivial data validation 
programs, such as verifying that a field contains 
only numeric information or that certain delim- 
iters are properly balanced. The combination of 
textual and numeric processing is invaluable 
here. 

A second area of use is as a data 
transformer, converting data from the form pro- 
duced by one program into that expected by 
another. The simplest examples merely select 
fields, perhaps with rearrangements. 



5. Implementation 

The actual implementation of awk uses the 
language development tools available on the 
UNIX operating system. The grammar is 
specified with yacc; 4 the lexical analysis is done 
by lex: the regular expression recognizers are 
deterministic finite automata constructed directly 
from the expressions. An awk program is 
translated into a parse tree which is then directly 
executed by a simple interpreter. 

Awk was designed for ease of use rather 
than processing speed; the delayed evaluation of 
variable types and the necessity to break input 
into fields makes high speed difficult to achieve 
in any case. Nonetheless, the program has not 
proven to be unworkably slow. 

Table I below shows the execution (user 
+ system) time on a PDP-11/70 of the UNIX 
programs wc, grep, egrep, Jgrep, sed, lex, and 
awk on the following simple tasks: 

1. count the number of lines. 

2. print all lines containing "doug". 

3. print all lines containing "doug", "ken" 
or "dmr". 

4. print the third field of each line. 

5. print the third and second fields of each 
line, in that order. 

6. append all lines containing "doug", 
"ken", and "dmr" to files "jdoug", 
"jken", and "jdmr", respectively. 

7. print each line prefixed by "line- 
number : 

8. sum the fourth column of a table. 

The program wc merely counts words, lines and 
characters in its input; we have already men- 
tioned the others. In all cases the input was a 
file containing 10,000 lines as created by the 
command /s — /; each line has the form 

-rw-rw-rw- 1 ava 123 Oct 15 17:05 

The total length of this input is 452,960 charac- 
ters. Times for lex do not include compile or 
load. 

As might be expected, awk is not as fast 
as the specialized tools wc, sed, or the programs 
in the grep family, but is faster than the more 
general tool lex. In all cases, the tasks were 
about as easy to express as awk programs as pro- 
grams in these other languages; tasks involving 
fields were considerably easier to express as awk 
programs. Some of the test programs are shown 
in awk, sed and lex. 
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Table 1. Execution Times of Programs. (Times are in sec.) 



The programs for some of these jobs are 
shown below. The lex programs are generally 
too long to show. 

AWK: 

1. END {print NR) 

2. /doug/ 

3. /kenldougldmr/ 

4. {print $3} 

5. {print $3, $2} 

6. /ken/ {print >"jken M ) 
/doug/ {print > M jdoug M } 
/dmr/ {print >"jdmr"} 

7. {print NR M : " $0) 

8. {sum = sum + $4} 
END {print sum} 

SED: 

1. $ = 

2. /doug/p 

3. /doug/p 
/doug/d 
/ken/p 
/ken/d 
/dmr/p 
/dmr/d 

4. /[" J* [ ]«r 1* [ M([" M) ../s/Al/p 

5. /[' h I 1*\U" M) ( RU" h\) .-/s/A2 \1/p 

6. /ken/w jken 
/doug/w jdoug 
/dmr/w jdmr 



LEX: 
1. 



%{ 

int i; 

%} 
%% 

\n i+ + ; 
%% 

yywrapO { 



printf("%d\n", i); 



%% 
♦doug^S 

\n ;' 



printf("%s\n", yytext); 



